On Sun, Aug 6, 2017 at 1:06 PM, Peter Geoghegan <p...@bowt.ie> wrote: > On Sat, Aug 5, 2017 at 8:26 PM, Tom Lane <t...@sss.pgh.pa.us> wrote: >> I'm quite disturbed though that the set of installed collations on these >> two test cases seem to be entirely different both from each other and from >> what you reported. The base collations look generally similar, but the >> "keyword variant" versions are not comparable at all. Considering that >> the entire reason we are interested in ICU in the first place is its >> alleged cross-version collation behavior stability, this gives me the >> exact opposite of a warm fuzzy feeling. We need to understand why it's >> like that and what we can do to reduce the variation, or else we're just >> buying our users enormous future pain. At least with the libc collations, >> you can expect that if you have en_US.utf8 available today you will >> probably still have en_US.utf8 available tomorrow. I am not seeing any >> reason to believe that the same holds for ICU collations. > > +1. That seems like something that is important to get right up-front.
I've looked into this. I'll give an example of what keyword variants there are for Greek, and then discuss what I think each is. These keyword variant locations on my machine with master + ICU support (ICU 55): postgres=# \dOS+ el-* List of collations Schema │ Name │ Collate │ Ctype │ Provider │ Description ────────────┼────────────────────────┼──────────────────┼──────────────────┼──────────┼───────────── pg_catalog │ el-u-co-emoji-x-icu │ el-u-co-emoji │ el-u-co-emoji │ icu │ Greek pg_catalog │ el-u-co-eor-x-icu │ el-u-co-eor │ el-u-co-eor │ icu │ Greek pg_catalog │ el-u-co-search-x-icu │ el-u-co-search │ el-u-co-search │ icu │ Greek pg_catalog │ el-u-co-standard-x-icu │ el-u-co-standard │ el-u-co-standard │ icu │ Greek pg_catalog │ el-x-icu │ el │ el │ icu │ Greek (5 rows) Greek has only one region, standard Greek. A few other language-regions have variations like multiple regions (e.g. Austrian German), or a phonebook variant, which you don't see here. Almost all have -emoji, -search, and -standard, which you do see here. We pass "commonlyUsed = true" to ucol_getKeywordValuesForLocale() within pg_import_system_collations(), and so it "will return only commonly used values with the given locale in preferred order". But should we go even further? If the charter of pg_import_system_collations() is to import every possible valid collation for pg_collation, then it's already failing at that by limiting itself to "common variants". I agree with the decision to do that, though, and I think we probably need to go a bit further. Possible issues with current ICU pg_collation entries after initdb: * I don't think we should have user-visible "search" collations at all. Apparently "search" collations are useful because "primary- and secondary-level distinctions for searching may not be the same as those for sorting; in ICU, many languages provide a special "search" collator with the appropriate level settings for search" [1]. I don't think that we should expose "search" keyword variants at all, because clearly they're an implementation detail that Postgres may one day have special knowledge of [2], to correctly mix searching and sorting semantics. For the time being, those should simply not be added within pg_import_system_collations(). Someone could still create the entries themselves, which seems harmless. Let's avoid establishing the expectation that they'll be in pg_collation. * Redundant ICU spellings for the same collation seem to appear. I find it questionable that there is both a "el-x-icu" and a "el-u-co-standard-x-icu". That looks like an artifact of how pg_import_system_collations() was written, as opposed to a bonafide behavioral difference. I cannot find an example of a "$COUNTRY_CODE-x-icu" collation without a corresponding "$COUNTRY_CODE-*-u-standard-x-icu" (The situation is similar for regional variants, like Austrian German). What, if anything, is the difference between each such pair of collations? Can we find a way to provide only one canonical entry if those are simply different ICU spellings? * Many emoji variant collations. I have to wonder if there is much value in creating so many pg_collation entries that are mere variants to do pictographic emoji sorting. Call me a killjoy, but I think that users that want that behavior can create the collations themselves. We could still document it. I wouldn't mind it if there wasn't so many emoji collations. * Many EOR variant collations. EOR as a collation variant is an ICU hack to get around the fact that EOR doesn't fit with their taxonomy for locales. My understanding is that there is supposed to be one EOR collation, used across Europe, per the ISO standard. I think ICU structures it as a variant because ICU only provides collations through locales, and collation is only one property of a locale. EOR has no opinion about what a currency sign should look like, unlike an ICU locale. Maybe we should only have one EOR collation unless the user creates one of their own. We only care about distinct collation behavior, at least as far as ICU knows. [1] http://userguide.icu-project.org/collation/icu-string-search-service [2] http://www.unicode.org/reports/tr35/#UnicodeCollationIdentifier -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers