[HACKERS] ICU collation variant keywords and pg_collation entries (Was: [BUGS] Crash report for some ICU-52 (debian8) COLLATE and work_mem values)

Peter Geoghegan Sun, 06 Aug 2017 17:09:07 -0700

On Sun, Aug 6, 2017 at 1:06 PM, Peter Geoghegan <p...@bowt.ie> wrote:
> On Sat, Aug 5, 2017 at 8:26 PM, Tom Lane <t...@sss.pgh.pa.us> wrote:
>> I'm quite disturbed though that the set of installed collations on these
>> two test cases seem to be entirely different both from each other and from
>> what you reported.  The base collations look generally similar, but the
>> "keyword variant" versions are not comparable at all.  Considering that
>> the entire reason we are interested in ICU in the first place is its
>> alleged cross-version collation behavior stability, this gives me the
>> exact opposite of a warm fuzzy feeling.  We need to understand why it's
>> like that and what we can do to reduce the variation, or else we're just
>> buying our users enormous future pain.  At least with the libc collations,
>> you can expect that if you have en_US.utf8 available today you will
>> probably still have en_US.utf8 available tomorrow.  I am not seeing any
>> reason to believe that the same holds for ICU collations.
>
> +1. That seems like something that is important to get right up-front.

I've looked into this. I'll give an example of what keyword variants
there are for Greek, and then discuss what I think each is. These
keyword variant locations on my machine with master + ICU support (ICU
55):

postgres=# \dOS+ el-*
List of collations
Schema │ Name │ Collate │ Ctype
│ Provider │ Description
────────────┼────────────────────────┼──────────────────┼──────────────────┼──────────┼─────────────
pg_catalog │ el-u-co-emoji-x-icu │ el-u-co-emoji │
el-u-co-emoji │ icu │ Greek
pg_catalog │ el-u-co-eor-x-icu │ el-u-co-eor │ el-u-co-eor
│ icu │ Greek
pg_catalog │ el-u-co-search-x-icu │ el-u-co-search │
el-u-co-search │ icu │ Greek
pg_catalog │ el-u-co-standard-x-icu │ el-u-co-standard │
el-u-co-standard │ icu │ Greek
pg_catalog │ el-x-icu │ el │ el
│ icu │ Greek
(5 rows)

Greek has only one region, standard Greek. A few other
language-regions have variations like multiple regions (e.g. Austrian
German), or a phonebook variant, which you don't see here. Almost all
have -emoji, -search, and -standard, which you do see here.

We pass "commonlyUsed = true" to ucol_getKeywordValuesForLocale()
within pg_import_system_collations(), and so it "will return only
commonly used values with the given locale in preferred order". But
should we go even further? If the charter of
pg_import_system_collations() is to import every possible valid
collation for pg_collation, then it's already failing at that by
limiting itself to "common variants". I agree with the decision to do
that, though, and I think we probably need to go a bit further.

Possible issues with current ICU pg_collation entries after initdb:

* I don't think we should have user-visible "search" collations at all.

Apparently "search" collations are useful because "primary- and
secondary-level distinctions for searching may not be the same as
those for sorting; in ICU, many languages provide a special "search"
collator with the appropriate level settings for search" [1]. I don't
think that we should expose "search" keyword variants at all, because
clearly they're an implementation detail that Postgres may one day
have special knowledge of [2], to correctly mix searching and sorting
semantics. For the time being, those should simply not be added within
pg_import_system_collations(). Someone could still create the entries
themselves, which seems harmless. Let's avoid establishing the
expectation that they'll be in pg_collation.

* Redundant ICU spellings for the same collation seem to appear.

I find it questionable that there is both a "el-x-icu" and a
"el-u-co-standard-x-icu". That looks like an artifact of how
pg_import_system_collations() was written, as opposed to a bonafide
behavioral difference. I cannot find an example of a
"$COUNTRY_CODE-x-icu" collation without a corresponding
"$COUNTRY_CODE-*-u-standard-x-icu" (The situation is similar for
regional variants, like Austrian German). What, if anything, is the
difference between each such pair of collations? Can we find a way to
provide only one canonical entry if those are simply different ICU
spellings?

* Many emoji variant collations.

I have to wonder if there is much value in creating so many
pg_collation entries that are mere variants to do pictographic emoji
sorting. Call me a killjoy, but I think that users that want that
behavior can create the collations themselves. We could still document
it. I wouldn't mind it if there wasn't so many emoji collations.

* Many EOR variant collations.

EOR as a collation variant is an ICU hack to get around the fact that
EOR doesn't fit with their taxonomy for locales. My understanding is
that there is supposed to be one EOR collation, used across Europe,
per the ISO standard. I think ICU structures it as a variant because
ICU only provides collations through locales, and collation is only
one property of a locale. EOR has no opinion about what a currency
sign should look like, unlike an ICU locale.

Maybe we should only have one EOR collation unless the user creates
one of their own. We only care about distinct collation behavior, at
least as far as ICU knows.

[1] http://userguide.icu-project.org/collation/icu-string-search-service
[2] http://www.unicode.org/reports/tr35/#UnicodeCollationIdentifier
--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] ICU collation variant keywords and pg_collation entries (Was: [BUGS] Crash report for some ICU-52 (debian8) COLLATE and work_mem values)

Reply via email to