On Sun, Sep 24, 2017 at 9:24 PM, Peter Geoghegan <p...@bowt.ie> wrote: > * Documents the aforementioned keyword collation attribute restriction > on ICU versions before ICU 54. This was needed anyway. We only claim > for Postgres collations what the ICU docs claim for ICU collators, > even though there is reason to believe that some ICU versions before > ICU 54 actually can do better.
Following some digging, it looks like the restriction actually only needs to apply on versions prior to ICU 4.6. The internal function _canonicalize() contains the functionality need for Postgres to avoid converting to BCP 47 itself on the fly (this was previously believed to be the case only in ICU 54): https://ssl.icu-project.org/trac/browser/tags/release-50-1-2/icu4c/source/common/uloc.cpp#L1614 (This is uloc.c on earlier ICU versions.) (The call stack here is that from ucol_open(), ucol_open_internal() is called, which calls uloc_getBaseName(), which calls _canonicalize().) ICU gained that _hasBCP47Extension() branch in ICU 46. The _canonicalize function has hardly changed since, despite the rewrite from C to C++ (I checked up until ICU 55, the reasonably current version I've done most testing on). This is totally consistent with what both Peter and I have observed, even though the ucol_open() docs suggest that that stuff only became available within ICU 54. I think that ucol_open() was updated to mention BCP 47 at the same time as it mentioned the lifting of the restrictions on extended keywords (when you no longer had to use ucol_setAttribute()), confusing two basically unrelated issues. I suggest that as a compromise, my proposal can be changed in either of the following ways: * Do the same thing as I propose, except only introduce the mapping from/to language tags when ICU version < 46 is in use. This would mean that we'd be doing that far less in practice. Or: * Never do *any* mapping/language tag conversion when opening a collator, but still consistently canonicalize as BCP 47 on all ICU versions. This can be done by only supporting versions that have the required _hasBCP47Extension() branch (ICU >= 46). We'd desupport the old ICU 42 and ICU 44 versions. Support for ICU 42 and ICU 44 came fairly late, in commit eccead9 on August 5th. Both Tom and I expressed reservations about that at the time. It doesn't seem too late to desupport those 2 versions, leaving us with a fix that is even less invasive. We actually wouldn't technically be changing anything about canonicalization at all, this way. We'd be changing our understanding of when a restriction applies (not < 54, < 46), at which point the "collcollate = U_ICU_VERSION_MAJOR_NUM >= 54 ? langtag : name" thing that I take particular issue with becomes "collcollate = U_ICU_VERSION_MAJOR_NUM >= 46 ? langtag : name". This is equivalent to "collcollate = langtag" (which is what it would actually look like), since we're desupporting earlier versions. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (firstname.lastname@example.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers