Re: ICU locale validation / canonicalization

2023-07-07 Thread Jeff Davis
On Sat, 2023-07-01 at 10:31 -0700, Noah Misch wrote: > As of commit b8c3f6d, InstallCheck-C got daticulocale=en-US-u-va- > posix.  Check > got daticulocale=NULL. With the same test setup, that locale takes about 8.6 seconds (opening it 10M times), about 2.5X slower than "en-US" and about 7X

Re: ICU locale validation / canonicalization

2023-07-01 Thread Noah Misch
On Sat, May 20, 2023 at 10:19:30AM -0700, Jeff Davis wrote: > On Tue, 2023-05-02 at 07:29 -0700, Noah Misch wrote: > > On Thu, Mar 30, 2023 at 08:59:41AM +0200, Peter Eisentraut wrote: > > > On 30.03.23 04:33, Jeff Davis wrote: > > > > Attached is a new version of the final patch, which performs >

Re: ICU locale validation / canonicalization

2023-05-20 Thread Jeff Davis
On Tue, 2023-05-02 at 07:29 -0700, Noah Misch wrote: > On Thu, Mar 30, 2023 at 08:59:41AM +0200, Peter Eisentraut wrote: > > On 30.03.23 04:33, Jeff Davis wrote: > > > Attached is a new version of the final patch, which performs > > > canonicalization. I'm not 100% sure that it's wanted, but it >

Re: ICU locale validation / canonicalization

2023-05-02 Thread Noah Misch
On Thu, Mar 30, 2023 at 08:59:41AM +0200, Peter Eisentraut wrote: > On 30.03.23 04:33, Jeff Davis wrote: > >Attached is a new version of the final patch, which performs > >canonicalization. I'm not 100% sure that it's wanted, but it still > >seems like a good idea to get the locales into a

Re: ICU locale validation / canonicalization

2023-04-04 Thread Thomas Munro
MSVC now says this on master: [17:48:12.446] c:\cirrus\src\backend\utils\adt\pg_locale.c(2912) : warning C4715: 'icu_language_tag': not all control paths return a value CI doesn't currently fail for MSVC warnings, so it's a bit hidden. FWIW cfbot does show this with a ⚠ sign with its new system

Re: ICU locale validation / canonicalization

2023-04-04 Thread Peter Eisentraut
On 31.03.23 12:11, Jeff Davis wrote: On Thu, 2023-03-30 at 08:59 +0200, Peter Eisentraut wrote: I don't think we should show the notice when the canonicalization doesn't change anything.  This is not useful: +NOTICE:  using language tag "und-u-kf-upper" for locale "und-u-kf- upper" Done.

Re: ICU locale validation / canonicalization

2023-03-31 Thread Jeff Davis
On Thu, 2023-03-30 at 08:59 +0200, Peter Eisentraut wrote: > I don't think we should show the notice when the canonicalization > doesn't change anything.  This is not useful: > > +NOTICE:  using language tag "und-u-kf-upper" for locale "und-u-kf- > upper" Done. > Also, the message should be

Re: ICU locale validation / canonicalization

2023-03-30 Thread Jeff Davis
On Thu, 2023-03-30 at 08:59 +0200, Peter Eisentraut wrote: > I don't think the special handling of IsBinaryUpgrade is needed or > wanted.  I would hope that with this feature, all old-style locale > IDs > would go away, but this way we would keep them forever.  If we > believe > that

Re: ICU locale validation / canonicalization

2023-03-30 Thread Peter Eisentraut
On 30.03.23 04:33, Jeff Davis wrote: Attached is a new version of the final patch, which performs canonicalization. I'm not 100% sure that it's wanted, but it still seems like a good idea to get the locales into a standard format in the catalogs, and if a lot more people start using ICU in v16

Re: ICU locale validation / canonicalization

2023-03-29 Thread Jeff Davis
On Tue, 2023-03-28 at 08:41 +0200, Peter Eisentraut wrote: > [PATCH v9 1/5] Fix error inconsistency in older ICU versions. > > ok Committed 0001. > [PATCH v9 2/5] initdb: replace check_icu_locale() with >   default_icu_locale(). > > I would keep the #ifdef USE_ICU inside the lower-level

Re: ICU locale validation / canonicalization

2023-03-28 Thread Peter Eisentraut
[PATCH v9 1/5] Fix error inconsistency in older ICU versions. ok [PATCH v9 2/5] initdb: replace check_icu_locale() with default_icu_locale(). I would keep the #ifdef USE_ICU inside the lower-level function default_icu_locale(), like it was before, so that the higher-level setlocales()

Re: ICU locale validation / canonicalization

2023-03-24 Thread Jeff Davis
On Fri, 2023-03-24 at 10:10 +0100, Peter Eisentraut wrote: > Couldn't we do this in a simpler way by just freeing the collator > before > the ereport() calls. I committed a tiny patch to do this. We still need to address the error inconsistency though. The problem is that, in older ICU

Re: ICU locale validation / canonicalization

2023-03-24 Thread Peter Eisentraut
On 24.03.23 07:39, Jeff Davis wrote: On Thu, 2023-03-23 at 10:16 -0700, Jeff Davis wrote: I could get rid of the SQL-callable function and move the rest of the changes into 0006. I'll see if that arrangement works better, and that way we can add the SQL-callable function later (or perhaps not

Re: ICU locale validation / canonicalization

2023-03-24 Thread Peter Eisentraut
On 23.03.23 18:16, Jeff Davis wrote: In 0002, the error "opening default collator is not supported", should that be an assert or an elog?  Is it reachable by the user? It's not reachable by the user, but could catch a bug if we accidentally read a NULL field from the catalog or something like

Re: ICU locale validation / canonicalization

2023-03-24 Thread Jeff Davis
On Thu, 2023-03-23 at 10:16 -0700, Jeff Davis wrote: > I could get rid of the SQL-callable function and move the rest of the > changes into 0006. I'll see if that arrangement works better, and > that > way we can add the SQL-callable function later (or perhaps not at all > if it's not desired).

Re: ICU locale validation / canonicalization

2023-03-23 Thread Jeff Davis
On Thu, 2023-03-23 at 07:27 +0100, Peter Eisentraut wrote: > So, does uloc_canonicalize() always convert to ICU locale IDs?  What > if > you pass a language tag, does it convert it to ICU locale ID as well? Yes. The documentation is not clear on that point, but my testing shows that it does.

Re: ICU locale validation / canonicalization

2023-03-23 Thread Peter Eisentraut
On 22.03.23 19:05, Jeff Davis wrote: On Tue, 2023-03-21 at 10:35 +0100, Peter Eisentraut wrote: [PATCH v6 1/6] Support language tags in older ICU versions (53 and   earlier). In pg_import_system_collations(), this is now redundant and can be simplified: -   if

Re: ICU locale validation / canonicalization

2023-03-22 Thread Jeff Davis
On Tue, 2023-03-21 at 10:35 +0100, Peter Eisentraut wrote: > [PATCH v6 1/6] Support language tags in older ICU versions (53 and >   earlier). > > In pg_import_system_collations(), this is now redundant and can be > simplified: > > -   if (!pg_is_ascii(langtag) ||

Re: ICU locale validation / canonicalization

2023-03-21 Thread Peter Eisentraut
On 17.03.23 18:55, Jeff Davis wrote: On Wed, 2023-03-15 at 15:18 -0700, Jeff Davis wrote: I left out the validation patch for now, and I'm evaluating a different approach that will attempt to match to the locales retrieved with uloc_countAvailable()/uloc_getAvailable(). I like this approach,

Re: ICU locale validation / canonicalization

2023-03-17 Thread Jeff Davis
On Wed, 2023-03-15 at 15:18 -0700, Jeff Davis wrote: > I left out the validation patch for now, and I'm evaluating a > different > approach that will attempt to match to the locales retrieved with > uloc_countAvailable()/uloc_getAvailable(). I like this approach, attached new patch series with

Re: ICU locale validation / canonicalization

2023-03-15 Thread Jeff Davis
On Tue, 2023-03-14 at 23:47 -0700, Jeff Davis wrote: > On Tue, 2023-03-14 at 10:10 -0700, Jeff Davis wrote: > > One loose end is that we really should support language tags like > > "und" > > in those older versions (54 and earlier). Your commit d72900bded > > avoided the problem, but perhaps we

Re: ICU locale validation / canonicalization

2023-03-15 Thread Jeff Davis
On Tue, 2023-03-14 at 10:10 -0700, Jeff Davis wrote: > One loose end is that we really should support language tags like > "und" > in those older versions (54 and earlier). Your commit d72900bded > avoided the problem, but perhaps we should fix it by looking for > "und" > and replacing it with

Re: ICU locale validation / canonicalization

2023-03-14 Thread Jeff Davis
On Tue, 2023-03-14 at 08:08 +0100, Peter Eisentraut wrote: > Another issue that came to mind:  Right now, you can, say, develop > SQL > schemas on a newer ICU version, say, your laptop, and then deploy > them > on a server running an older ICU version.  If we have a cutoff beyond > which we

Re: ICU locale validation / canonicalization

2023-03-14 Thread Peter Eisentraut
On 13.03.23 16:31, Jeff Davis wrote: What we had discussed a while ago in one of these threads is that ICU before version 54 do not support keyword lists, and we have custom code to do that parsing ourselves, but we don't have code to do the same for language tags.  Therefore, if I understand

Re: ICU locale validation / canonicalization

2023-03-13 Thread Jeff Davis
On Mon, 2023-03-13 at 08:25 +0100, Peter Eisentraut wrote: > For clarification, I wasn't complaining about the notice, but about > the > automatic conversion from old-style ICU locale ID to language tag. Canonicalization means that we pick one format, and automatically convert to it, right? >

Re: ICU locale validation / canonicalization

2023-03-13 Thread Peter Eisentraut
On 09.03.23 21:17, Jeff Davis wrote: Personally, I'm not on board with this behavior: => CREATE COLLATION test (provider = icu, locale = 'de@collation=phonebook'); NOTICE:  0: using language tag "de-u-co-phonebk" for locale "de@collation=phonebook" I mean, maybe that is a thing we want to

Re: ICU locale validation / canonicalization

2023-03-09 Thread Jeff Davis
On Thu, 2023-03-09 at 09:46 +0100, Peter Eisentraut wrote: > This patch appears to do about three things at once, and it's not > clear > exactly where the boundaries are between them and which ones we might > actually want.  And I think the terminology also gets mixed up a bit, > which makes

Re: ICU locale validation / canonicalization

2023-03-09 Thread Peter Eisentraut
On 28.02.23 06:57, Jeff Davis wrote: On Mon, 2023-02-20 at 15:23 -0800, Jeff Davis wrote: New patch attached. The new patch also includes a GUC that (when enabled) validates that the collator is actually found. New patch attached. Now it always preserves the exact locale string during

Re: ICU locale validation / canonicalization

2023-02-27 Thread Jeff Davis
On Mon, 2023-02-20 at 15:23 -0800, Jeff Davis wrote: > > New patch attached. The new patch also includes a GUC that (when > enabled) validates that the collator is actually found. New patch attached. Now it always preserves the exact locale string during pg_upgrade, and does not attempt to

Re: ICU locale validation / canonicalization

2023-02-20 Thread Jeff Davis
New patch attached. The new patch also includes a GUC that (when enabled) validates that the collator is actually found. On Mon, 2023-02-20 at 15:46 +0100, Peter Eisentraut wrote: > a) BCP47 tags are preferred, and Agreed. > b) They don't work with ICU versions before 54. I tried in versions

Re: ICU locale validation / canonicalization

2023-02-20 Thread Peter Eisentraut
On 10.02.23 18:53, Jeff Davis wrote: To represent ICU locale strings in the catalog consistently, we have two choices, which as far as I can tell are equivalent: 1. ICU format Locale IDs. These are more readable, and still specified (albeit non-standard). 2. BCP47 language tags. These are

Re: ICU locale validation / canonicalization

2023-02-16 Thread Jeff Davis
On Thu, 2023-02-09 at 14:09 -0800, Jeff Davis wrote: > It feels like BCP 47 is the right catalog representation. We are > already using it for the import of initial collations, and it's a > standard, and there seems to be good support in ICU. Patch attached. We should have been canonicalizing

Re: ICU locale validation / canonicalization

2023-02-13 Thread Jeff Davis
On Fri, 2023-02-10 at 22:50 -0500, Robert Haas wrote: > The fact that you're figuring out how it all works from reading the > source code does not give me a warm feeling. Right. On the other hand, the behavior is quite well documented, it was just the keyword that was undocumented (or I didn't

Re: ICU locale validation / canonicalization

2023-02-10 Thread Robert Haas
On Fri, Feb 10, 2023 at 12:54 PM Jeff Davis wrote: > In our tests you can see colstrength=primary is used to mean "case > insensitive". That's where I picked up the "colstrength" keyword, which > is also present in the ICU sources, but now that you ask I'm embarassed > that I don't see the

Re: ICU locale validation / canonicalization

2023-02-10 Thread Jeff Davis
On Fri, 2023-02-10 at 09:43 -0500, Robert Haas wrote: > On Thu, Feb 9, 2023 at 5:09 PM Jeff Davis wrote: > > I do like the ICU format locale IDs from a readability standpoint. > > "en_US@colstrength=primary" is more meaningful to me than "en-US-u- > > ks- > > level1" (the equivalent language

Re: ICU locale validation / canonicalization

2023-02-10 Thread Jeff Davis
On Fri, 2023-02-10 at 07:42 +0100, Peter Eisentraut wrote: > It turns out that 'de_AT' is actually a distinct collation from 'de' > in > CLDR, so that was not the best example.  What behavior do you see for > 'de_CH'? The canonicalized form is de_CH and the bcp47 tag is de-CH.

Re: ICU locale validation / canonicalization

2023-02-10 Thread Robert Haas
On Thu, Feb 9, 2023 at 5:09 PM Jeff Davis wrote: > I do like the ICU format locale IDs from a readability standpoint. > "en_US@colstrength=primary" is more meaningful to me than "en-US-u-ks- > level1" (the equivalent language tag). Sadly, neither of those means a whole lot to me? :-( How did

Re: ICU locale validation / canonicalization

2023-02-09 Thread Peter Eisentraut
On 09.02.23 22:15, Jeff Davis wrote: On Thu, 2023-02-09 at 15:44 +0100, Peter Eisentraut wrote: One use case is that if a user specifies a locale, say, of 'de-AT', this might canonicalize to 'de' today, Canonicalization should not lose useful information, it should just rearrange it, so I

Re: ICU locale validation / canonicalization

2023-02-09 Thread Andreas Karlsson
On 2/10/23 02:22, Jeff Davis wrote: We will still allow the ICU format locale IDs for input; we would just convert them to BCP47 before storing them in the catalog. And there's an inverse function, so it's easy enough to offer a view that shows the ICU format locale IDs in addition to the BCP 47

Re: ICU locale validation / canonicalization

2023-02-09 Thread Jeff Davis
On Fri, 2023-02-10 at 01:04 +0100, Andreas Karlsson wrote: > I have the same feeling one is readable and the other unreadable but > the > unreadable one is standardized. Hard call. > > And in general I agree, if we are going to make ICU default it needs > to > be more user friendly than it is

Re: ICU locale validation / canonicalization

2023-02-09 Thread Andreas Karlsson
On 2/9/23 23:09, Jeff Davis wrote: I do like the ICU format locale IDs from a readability standpoint. "en_US@colstrength=primary" is more meaningful to me than "en-US-u-ks- level1" (the equivalent language tag). And the format is specified[1], even though it's not an independent standard. But I

Re: ICU locale validation / canonicalization

2023-02-09 Thread Jeff Davis
On Thu, 2023-02-09 at 10:53 -0500, Robert Haas wrote: > Unfortunately, I have no idea whether your specific ideas about how > to > make that happen are any good or not. But I hope they are, because > the > current situation is pessimal. It feels like BCP 47 is the right catalog representation. We

Re: ICU locale validation / canonicalization

2023-02-09 Thread Jeff Davis
On Thu, 2023-02-09 at 15:44 +0100, Peter Eisentraut wrote: > One use case is that if a user specifies a locale, say, of 'de-AT', > this > might canonicalize to 'de' today, Canonicalization should not lose useful information, it should just rearrange it, so I don't see a risk here based on what I

Re: ICU locale validation / canonicalization

2023-02-09 Thread Robert Haas
On Wed, Feb 8, 2023 at 2:59 AM Jeff Davis wrote: > We do check that the value is accepted by ICU, but ICU seems to accept > anything and use some fallback logic. Bogus strings will typically end > up as the "root" locale (spelled "root" or ""). I've noticed this, and I think it's really

Re: ICU locale validation / canonicalization

2023-02-09 Thread Peter Eisentraut
On 08.02.23 08:59, Jeff Davis wrote: The overall benefit here is that we keep our catalogs consistently using an independent standard format for ICU locale strings, rather than whatever the user specifies. That makes it less likely that ICU needs to use any fallback logic when trying to open a

ICU locale validation / canonicalization

2023-02-07 Thread Jeff Davis
Right now, ICU locales are not validated: initdb ... --locale-provider=icu --icu-locale=anything CREATE COLLATION foo (PROVIDER=icu, LOCALE='anything'); CREATE DATABASE anythingdb ICU_LOCALE 'anything'; all succeed. We do check that the value is accepted by ICU, but ICU seems to accept