On Fri, Sep 22, 2017 at 11:34 AM, Peter Eisentraut <peter.eisentr...@2ndquadrant.com> wrote: > After reviewing this thread and testing around a bit, I think the code > is mostly fine as it is, but the documentation is lacking. Hence, > attached is a patch to expand the documentation quite a bit, especially > to document in more detail what ICU locale strings are accepted. > > The documentation has always stated, albeit perhaps in obscure ways, > that we accept for locales whatever ICU accepts. I don't think we > should restrict or override that in any way. That would cause existing > documentation and lore on ICU to be invalid for PostgreSQL.
I think that this thread is mostly about three fairly different things. I didn't plan it that way, but then I only realized the second two while investigating the first. Those are: 1. The question of whether and to what extent we should sanitize the colcollate string provided by the user within CREATE COLLATION for the ICU collation provider. 2. The question of what ends up in pg_collation at initdb time. In particular, the format of colcollate. 3. The question of whether or not we should ever accept a locale in the traditional format, rather than insisting on BCP 47 in every context. (This may have become conflated with issue 1.) > I specifically disagree that we should, as appears to be suggested here, > restrict the input locale strings to BCP 47 and reject or transform the > traditional ICU-specific locale syntax. Again, that would cause > existing ICU documentation material to become less applicable to > PostgreSQL. And basically, there is no reason for it, because I am not > aware that ICU plans to get rid of that syntax. I disagree, because ICU/CLDR seems to want to standardize on BCP 47 (with custom extensions), but I acknowledge that you have a point here. This isn't what I think is important, among all the points raised on this thread. I can let this go. > Moreover, we need to > support that syntax for older ICU versions anyway. FWIW, I don't think that we *need* to support it for older ICU versions, except as an implementation detail that gets us to and from BCP 47 as needed. > A patch has been > posted that, as I understand it, would allow BCP 47 syntax to be used > with older versions as well. That's a nice idea, but it's a new feature, > which may have been my fault, which may have been my fault > and not appropriate for PG10 at this time. > > (Note that we also don't canonicalize libc locale names.) But you are *already* canonicalizing ICU collation names as BCP 47. My point here is: Why not finish the job off, and *also* canonicalize colcollate in the same way? This won't break ucol_open() if we take appropriate precautions when we go to use the Postgres collation/ICU locale. One concern that makes me suggest this is: What happens when the user *downgrades* ICU version, from a version where colcollate is BCP 47 to one where it would not have been at initdb time? That will break the downgrade in an unpleasant way, including in installations that never do a CREATE COLLATION themselves. We want to be able to restore a basebackup on a somewhat different OS, and have that still work following REINDEX. At least, that seems like it should be an important goal for us. I want Postgres to insist on always using BCP 47 in CREATE COLLATION for a few reasons. One that is relevant to this colcollate question is that by insisting on BCP 47 within CREATE COLLATION, there is no question of CREATE COLLATION having to consider the legacy locale format on ICU versions that don't handle both at the same time too well (this at least includes ICU 42). We'd always only accept BCP 47 from users, we'd always store BCP 47 as colcollate (and collation name), and we'd always use the traditional locale string format as an argument to ucol_open(). Consistency of interface and implementation, across all ICU versions. I might actually be convinced by what you say here if we weren't already canonicalizing collation name as BCP 47 (though I also understand why you did that). > During testing with various versions I have also discovered the > following things: > > - The current code appears to be of the opinion that BCP 47 locale names > are accepted as of ICU 54. That appears to be incorrect; I find that > they work fine in ICU 52, but they don't work in ICU 4.2. I don't know > where the cutoff is. That might be worth changing in the code if we can > obtain more information. I fear that BCP 47 is only quasi supported (without the proper conversion by uloc_forLanguageTag()) prior to ICU 54 (the version where it is actually documented as supported). We've already seen plenty of cases where ucol_open() locale string interpretation soldiers on, when it arguably shouldn't, so I hope that that isn't what you actually see on ICU 52. I strongly suggest looking at a variety of display names at CREATE COLLATION time (I can provide a rough patch that shows display name [1]), to make sure that it matches expectations on all ICU versions. Your testing patch does not satisfy me that versions prior to ICU 54 have a ucol_open() that accepts BCP 47 without *any* problem (it could be a subtle issue). Besides, if it isn't going to work on ICU 4.2 anyway, I think that the leniency of ICU 52 isn't actually interesting. We still have an awkward special case to deal with. > One conclusion here is that there are multiple dimensions to what sort > of thing is accepted as a locale in different versions of ICU and what > the canonical format is. If we insist that everything has to be written > in the form that is preferred today, then we'll have awkward problems if > a future version of ICU establishes even more variants that will then be > preferred. I'd feel a lot better about that if there was a NOTICE showing the ICU display name for the locale shown when the user does a CREATE COLLATION. I think that the risks from not doing that outweigh the risks of doing it, even at this late date. I could live without any sanitization if we did this much. Without this, the user is flying blind when using CREATE COLLATION. Why should the DBA be able to verify that the sorting they get is culturally appropriate? That's just not practical. > I think there is room for incremental refinement here. Features like > optionally checking or printing the canonical or preferred locale format > or making the locale description available via a function or system view > might all be valuable additions to a future PostgreSQL release. I don't think that there is room for incremental refinement when it comes to what ends up in pg_collation at initdb time. I don't like to create commotion at this late date, but there are some things that we'll only really get one chance at here. Admittedly, that's not true of everything I raise here; it's hard to strictly limit discussion to issues that have that quality. [1] postgr.es/m/cah2-wznopmj+3xh6bvea_yuyd4zdgiwg9yce31q09ou3xxw...@mail.gmail.com -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers