On Tue, 2023-03-21 at 10:35 +0100, Peter Eisentraut wrote: > [PATCH v6 1/6] Support language tags in older ICU versions (53 and > earlier). > > In pg_import_system_collations(), this is now redundant and can be > simplified: > > - if (!pg_is_ascii(langtag) || !pg_is_ascii(iculocstr)) > + if (!pg_is_ascii(langtag) || !pg_is_ascii(langtag)) > > icu_set_collation_attributes() needs more commenting about what is > going > on. My guess is that uloc_canonicalize() converts from language tag > to > ICU locale ID, and then the existing logic to parse that apart would > apply. Is that how it works?
Fixed the redundancy, added some comments, and committed 0001. > [PATCH v6 2/6] Wrap ICU ucol_open(). > > It makes sense to try to unify some of this. But I find the naming > confusing. If I see pg_ucol_open(), then I would expect that all > calls > to ucol_open() would be replaced by this. But here it's only a few, > without explanation. (pg_ucol_open() has no explanation at all > AFAICT.) The remaining callsite which doesn't use the wrapper is in initdb.c, which can't call into pg_locale.c, and has different intentions. initdb uses ucol_open to get the default locale if icu_locale is not specified; and it also uses ucol open to verify that the locale can be opened (whether specified or the default). (Aside: I created a tiny 0004 patch which makes this difference more clear and adds a nice comment.) There's no reason to use a wrapper when getting the default locale, because it's just passing NULL anyway. When verifying that the locale can be opened, ucol_open() doesn't catch many problems anyway, so I'm not sure it's worth a lot of effort to copy these extra checks that the wrapper does into initdb.c. For instance, what's the value in replacing "und" with "root" if opening either will succeed? Parsing the attributes can potentially catch problems, but the later patch 0006 will check the attributes when converting to a language tag at initdb time. So I'm inclined to just leave initdb alone in patches 0002 and 0003. > I have in my notes that check_icu_locale() and make_icu_collator() > should be combined into a single function. I think that would be a > better way to slice it. That would leave out get_collation_actual_version(), which should handle the same fixups for attributes and the "und" locale. > Btw., I had intentionally not written code like this > > +#if U_ICU_VERSION_MAJOR_NUM < 54 > + icu_set_collation_attributes(collator, loc_str); > +#endif > > The disadvantage of doing it that way is that you then need to dig > out > an old version of ICU in order to check whether the code compiles at > all. With the current code, you can be sure that that code compiles > if > you make changes elsewhere. I was wondering about that -- thank you, I changed it back to use "if" rather than "#ifdef". New series attached (starting at 0002 to better correspond to the previous series). -- Jeff Davis PostgreSQL Contributor Team - AWS
From fbe03dc596b5e12f4dda60269e044caa58f8be32 Mon Sep 17 00:00:00 2001 From: Jeff Davis <j...@j-davis.com> Date: Tue, 14 Mar 2023 21:21:17 -0700 Subject: [PATCH v7 2/7] Wrap ICU ucol_open(). Hide details of supporting older ICU versions in a wrapper function. The current code only needs to handle icu_set_collation_attributes(), but a subsequent commit will add additional version-specific code. --- src/backend/utils/adt/pg_locale.c | 70 +++++++++++++++++++------------ 1 file changed, 43 insertions(+), 27 deletions(-) diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c index c3ede994be..dd0786dff5 100644 --- a/src/backend/utils/adt/pg_locale.c +++ b/src/backend/utils/adt/pg_locale.c @@ -140,6 +140,7 @@ static char *IsoLocaleName(const char *); */ static UConverter *icu_converter = NULL; +static UCollator *pg_ucol_open(const char *loc_str); static void init_icu_converter(void); static size_t uchar_length(UConverter *converter, const char *str, int32_t len); @@ -1430,17 +1431,8 @@ make_icu_collator(const char *iculocstr, { #ifdef USE_ICU UCollator *collator; - UErrorCode status; - status = U_ZERO_ERROR; - collator = ucol_open(iculocstr, &status); - if (U_FAILURE(status)) - ereport(ERROR, - (errmsg("could not open collator for locale \"%s\": %s", - iculocstr, u_errorName(status)))); - - if (U_ICU_VERSION_MAJOR_NUM < 54) - icu_set_collation_attributes(collator, iculocstr); + collator = pg_ucol_open(iculocstr); /* * If rules are specified, we extract the rules of the standard collation, @@ -1451,6 +1443,7 @@ make_icu_collator(const char *iculocstr, const UChar *default_rules; UChar *agg_rules; UChar *my_rules; + UErrorCode status; int32_t length; default_rules = ucol_getRules(collator, &length); @@ -1722,16 +1715,11 @@ get_collation_actual_version(char collprovider, const char *collcollate) if (collprovider == COLLPROVIDER_ICU) { UCollator *collator; - UErrorCode status; UVersionInfo versioninfo; char buf[U_MAX_VERSION_STRING_LENGTH]; - status = U_ZERO_ERROR; - collator = ucol_open(collcollate, &status); - if (U_FAILURE(status)) - ereport(ERROR, - (errmsg("could not open collator for locale \"%s\": %s", - collcollate, u_errorName(status)))); + collator = pg_ucol_open(collcollate); + ucol_getVersion(collator, versioninfo); ucol_close(collator); @@ -2505,6 +2493,43 @@ pg_strnxfrm_prefix(char *dest, size_t destsize, const char *src, } #ifdef USE_ICU + +/* + * Wrapper around ucol_open() to handle API differences for older ICU + * versions. + */ +static UCollator * +pg_ucol_open(const char *loc_str) +{ + UCollator *collator; + UErrorCode status; + + /* + * Must never open default collator, because it depends on the environment + * and may change at any time. + * + * NB: the default collator is not the same as the collator for the root + * locale. The root locale may be specified as the empty string, "und", or + * "root". The default collator is opened by passing NULL to ucol_open(). + */ + if (loc_str == NULL) + ereport(ERROR, + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), + errmsg("opening default collator is not supported"))); + + status = U_ZERO_ERROR; + collator = ucol_open(loc_str, &status); + if (U_FAILURE(status)) + ereport(ERROR, + (errmsg("could not open collator for locale \"%s\": %s", + loc_str, u_errorName(status)))); + + if (U_ICU_VERSION_MAJOR_NUM < 54) + icu_set_collation_attributes(collator, loc_str); + + return collator; +} + static void init_icu_converter(void) { @@ -2771,17 +2796,8 @@ check_icu_locale(const char *icu_locale) { #ifdef USE_ICU UCollator *collator; - UErrorCode status; - - status = U_ZERO_ERROR; - collator = ucol_open(icu_locale, &status); - if (U_FAILURE(status)) - ereport(ERROR, - (errmsg("could not open collator for locale \"%s\": %s", - icu_locale, u_errorName(status)))); - if (U_ICU_VERSION_MAJOR_NUM < 54) - icu_set_collation_attributes(collator, icu_locale); + collator = pg_ucol_open(icu_locale); ucol_close(collator); #else ereport(ERROR, -- 2.34.1
From 8027572146571609927815d0fe14f761fc86cf2c Mon Sep 17 00:00:00 2001 From: Jeff Davis <j...@j-davis.com> Date: Tue, 14 Mar 2023 22:28:21 -0700 Subject: [PATCH v7 3/7] Handle the "und" locale in ICU versions 54 and older. The "und" locale is an alternative spelling of the root locale, but it was not recognized until ICU 55. To maintain common behavior across all supported ICU versions, check for "und" and replace with "root" before opening. Previously, the lack of support for "und" was dangerous, because versions 54 and older fall back to the environment when a locale is not found. If the user specified "und" for the language (which is expected and documented), it could not only resolve to the wrong collator, but it could unexpectedly change (which could lead to corrupt indexes). This effectively reverts commit d72900bded, which worked around the problem for the built-in "unicode" collation, and is no longer necessary. Discussion: https://postgr.es/m/60da0cecfb512a78b8666b31631a636215d8ce73.ca...@j-davis.com Discussion: https://postgr.es/m/0c6fa66f2753217d2a40480a96bd2ccf023536a1.ca...@j-davis.com --- src/backend/utils/adt/pg_locale.c | 38 ++++++++++++++++++- src/bin/initdb/initdb.c | 2 +- .../regress/expected/collate.icu.utf8.out | 7 ++++ src/test/regress/sql/collate.icu.utf8.sql | 2 + 4 files changed, 46 insertions(+), 3 deletions(-) diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c index dd0786dff5..052db11413 100644 --- a/src/backend/utils/adt/pg_locale.c +++ b/src/backend/utils/adt/pg_locale.c @@ -2501,8 +2501,9 @@ pg_strnxfrm_prefix(char *dest, size_t destsize, const char *src, static UCollator * pg_ucol_open(const char *loc_str) { - UCollator *collator; - UErrorCode status; + UCollator *collator; + UErrorCode status; + char *fixed_str = NULL; /* * Must never open default collator, because it depends on the environment @@ -2517,6 +2518,36 @@ pg_ucol_open(const char *loc_str) (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("opening default collator is not supported"))); + /* + * In ICU versions 54 and earlier, "und" is not a recognized spelling of + * the root locale. If the first component of the locale is "und", replace + * with "root" before opening. + */ + if (U_ICU_VERSION_MAJOR_NUM < 55) + { + char lang[ULOC_LANG_CAPACITY]; + + status = U_ZERO_ERROR; + uloc_getLanguage(loc_str, lang, ULOC_LANG_CAPACITY, &status); + if (U_FAILURE(status)) + { + ereport(ERROR, + (errmsg("could not get language from locale \"%s\": %s", + loc_str, u_errorName(status)))); + } + + if (strcmp(lang, "und") == 0) + { + const char *remainder = loc_str + strlen("und"); + + fixed_str = palloc(strlen("root") + strlen(remainder) + 1); + strcpy(fixed_str, "root"); + strcat(fixed_str, remainder); + + loc_str = fixed_str; + } + } + status = U_ZERO_ERROR; collator = ucol_open(loc_str, &status); if (U_FAILURE(status)) @@ -2527,6 +2558,9 @@ pg_ucol_open(const char *loc_str) if (U_ICU_VERSION_MAJOR_NUM < 54) icu_set_collation_attributes(collator, loc_str); + if (fixed_str != NULL) + pfree(fixed_str); + return collator; } diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c index 68d430ed63..d48b7b6060 100644 --- a/src/bin/initdb/initdb.c +++ b/src/bin/initdb/initdb.c @@ -1498,7 +1498,7 @@ setup_collation(FILE *cmdfd) * that they win if libc defines a locale with the same name. */ PG_CMD_PRINTF("INSERT INTO pg_collation (oid, collname, collnamespace, collowner, collprovider, collisdeterministic, collencoding, colliculocale)" - "VALUES (pg_nextoid('pg_catalog.pg_collation', 'oid', 'pg_catalog.pg_collation_oid_index'), 'unicode', 'pg_catalog'::regnamespace, %u, '%c', true, -1, '');\n\n", + "VALUES (pg_nextoid('pg_catalog.pg_collation', 'oid', 'pg_catalog.pg_collation_oid_index'), 'unicode', 'pg_catalog'::regnamespace, %u, '%c', true, -1, 'und');\n\n", BOOTSTRAP_SUPERUSERID, COLLPROVIDER_ICU); PG_CMD_PRINTF("INSERT INTO pg_collation (oid, collname, collnamespace, collowner, collprovider, collisdeterministic, collencoding, collcollate, collctype)" diff --git a/src/test/regress/expected/collate.icu.utf8.out b/src/test/regress/expected/collate.icu.utf8.out index 6225b575ce..f135200c99 100644 --- a/src/test/regress/expected/collate.icu.utf8.out +++ b/src/test/regress/expected/collate.icu.utf8.out @@ -1312,6 +1312,13 @@ SELECT 'aBcD' COLLATE lt_insensitive = 'AbCd' COLLATE lt_insensitive; t (1 row) +CREATE COLLATION lt_upperfirst (provider = icu, locale = 'und-u-kf-upper'); +SELECT 'Z' COLLATE lt_upperfirst < 'z' COLLATE lt_upperfirst; + ?column? +---------- + t +(1 row) + CREATE TABLE test1cs (x text COLLATE case_sensitive); CREATE TABLE test2cs (x text COLLATE case_sensitive); CREATE TABLE test3cs (x text COLLATE case_sensitive); diff --git a/src/test/regress/sql/collate.icu.utf8.sql b/src/test/regress/sql/collate.icu.utf8.sql index 64cbfd0a5b..8105ebc8ae 100644 --- a/src/test/regress/sql/collate.icu.utf8.sql +++ b/src/test/regress/sql/collate.icu.utf8.sql @@ -521,6 +521,8 @@ SELECT 'abc' <= 'ABC' COLLATE case_insensitive, 'abc' >= 'ABC' COLLATE case_inse -- test language tags CREATE COLLATION lt_insensitive (provider = icu, locale = 'en-u-ks-level1', deterministic = false); SELECT 'aBcD' COLLATE lt_insensitive = 'AbCd' COLLATE lt_insensitive; +CREATE COLLATION lt_upperfirst (provider = icu, locale = 'und-u-kf-upper'); +SELECT 'Z' COLLATE lt_upperfirst < 'z' COLLATE lt_upperfirst; CREATE TABLE test1cs (x text COLLATE case_sensitive); CREATE TABLE test2cs (x text COLLATE case_sensitive); -- 2.34.1
From 537a53c0504b655a3d91156bfff3d5effcf21e06 Mon Sep 17 00:00:00 2001 From: Jeff Davis <j...@j-davis.com> Date: Wed, 15 Mar 2023 11:27:12 -0700 Subject: [PATCH v7 4/7] Accept C/POSIX locales when converting to language tag. Account for locales "C" and "POSIX", which correspond to the language tag "en-US-u-va-posix". Add a SQL function pg_language_tag() that performs the conversion. Also, don't rely on a fixed-size buffer for language tags, as there is no defined upper limit (cf. RFC 5646 section 4.4). --- doc/src/sgml/func.sgml | 15 ++++ src/backend/commands/collationcmds.c | 44 ++++++------ src/backend/utils/adt/pg_locale.c | 68 +++++++++++++++++++ src/bin/pg_dump/t/002_pg_dump.pl | 4 +- src/include/catalog/catversion.h | 2 +- src/include/catalog/pg_proc.dat | 5 ++ src/include/utils/pg_locale.h | 1 + .../regress/expected/collate.icu.utf8.out | 55 +++++++++++++++ src/test/regress/sql/collate.icu.utf8.sql | 10 +++ 9 files changed, 180 insertions(+), 24 deletions(-) diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index a3a13b895f..35cecc24c8 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -27530,6 +27530,21 @@ postgres=# SELECT '0/0'::pg_lsn + pd.segment_number * ps.setting::int + :offset Use of this function is restricted to superusers. </para></entry> </row> + + <row> + <entry role="func_table_entry"><para role="func_signature"> + <indexterm> + <primary>pg_language_tag</primary> + </indexterm> + <function>pg_language_tag</function> ( <parameter>locale</parameter> <type>text</type> ) + <returnvalue>text</returnvalue> + </para> + <para> + Canonicalizes the given <parameter>locale</parameter> string into a + BCP 47 language tag (see <xref + linkend="collation-managing-create-icu"/>). + </para></entry> + </row> </tbody> </tgroup> </table> diff --git a/src/backend/commands/collationcmds.c b/src/backend/commands/collationcmds.c index 3d0aea0568..ca1d46669f 100644 --- a/src/backend/commands/collationcmds.c +++ b/src/backend/commands/collationcmds.c @@ -576,26 +576,6 @@ cmpaliases(const void *a, const void *b) #ifdef USE_ICU -/* - * Get the ICU language tag for a locale name. - * The result is a palloc'd string. - */ -static char * -get_icu_language_tag(const char *localename) -{ - char buf[ULOC_FULLNAME_CAPACITY]; - UErrorCode status; - - status = U_ZERO_ERROR; - uloc_toLanguageTag(localename, buf, sizeof(buf), true, &status); - if (U_FAILURE(status)) - ereport(ERROR, - (errmsg("could not convert locale name \"%s\" to language tag: %s", - localename, u_errorName(status)))); - - return pstrdup(buf); -} - /* * Get a comment (specifically, the display name) for an ICU locale. * The result is a palloc'd string, or NULL if we can't get a comment @@ -957,7 +937,7 @@ pg_import_system_collations(PG_FUNCTION_ARGS) else name = uloc_getAvailable(i); - langtag = get_icu_language_tag(name); + langtag = icu_language_tag(name, false); /* * Be paranoid about not allowing any non-ASCII strings into @@ -1014,3 +994,25 @@ pg_import_system_collations(PG_FUNCTION_ARGS) PG_RETURN_INT32(ncreated); } + +/* + * pg_language_tag + * + * Return the BCP47 language tag representation of the given locale string. + */ +Datum +pg_language_tag(PG_FUNCTION_ARGS) +{ +#ifdef USE_ICU + text *locale_text = PG_GETARG_TEXT_PP(0); + char *locale_cstr = text_to_cstring(locale_text); + char *langtag = icu_language_tag(locale_cstr, false); + + PG_RETURN_TEXT_P(cstring_to_text(langtag)); +#else + ereport(ERROR, + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), + errmsg("ICU is not supported in this build"))); + PG_RETURN_NULL(); +#endif +} diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c index 052db11413..baafc71a3d 100644 --- a/src/backend/utils/adt/pg_locale.c +++ b/src/backend/utils/adt/pg_locale.c @@ -2840,6 +2840,74 @@ check_icu_locale(const char *icu_locale) #endif } +#ifdef USE_ICU +/* + * Return the BCP47 language tag representation of the requested locale. + * + * This function should be called before passing the string to ucol_open(), + * because conversion to a language tag also performs "level 2 + * canonicalization". In addition to producing a consistent format, level 2 + * canonicalization is able to more accurately interpret different input + * locale string formats, such as POSIX and .NET IDs. + */ +char * +icu_language_tag(const char *loc_str, bool noError) +{ + UErrorCode status; + char lang[ULOC_LANG_CAPACITY]; + char *langtag; + size_t buflen = 32; /* arbitrary starting buffer size */ + const bool strict = true; + + status = U_ZERO_ERROR; + uloc_getLanguage(loc_str, lang, ULOC_LANG_CAPACITY, &status); + if (U_FAILURE(status)) + { + ereport(ERROR, + (errmsg("could not get language from locale \"%s\": %s", + loc_str, u_errorName(status)))); + } + + /* C/POSIX locales aren't handled by uloc_getLanguageTag() */ + if (strcmp(lang, "c") == 0 || strcmp(lang, "posix") == 0) + return pstrdup("en-US-u-va-posix"); + + /* + * A BCP47 language tag doesn't have a clearly-defined upper limit + * (cf. RFC5646 section 4.4). Additionally, in older ICU versions, + * uloc_toLanguageTag() doesn't always return the ultimate length on the + * first call, necessitating a loop. + */ + langtag = palloc(buflen); + while (true) + { + int32_t len; + + status = U_ZERO_ERROR; + len = uloc_toLanguageTag(loc_str, langtag, buflen, strict, &status); + if (len < buflen || buflen >= MaxAllocSize) + break; + + buflen = Min(buflen * 2, MaxAllocSize); + langtag = repalloc(langtag, buflen); + } + + if (U_FAILURE(status)) + { + pfree(langtag); + if (noError) + return NULL; + + ereport(ERROR, + (errmsg("could not convert locale name \"%s\" to language tag: %s", + loc_str, u_errorName(status)))); + } + + return langtag; +} + +#endif /* USE_ICU */ + /* * These functions convert from/to libc's wchar_t, *not* pg_wchar_t. * Therefore we keep them here rather than with the mbutils code. diff --git a/src/bin/pg_dump/t/002_pg_dump.pl b/src/bin/pg_dump/t/002_pg_dump.pl index a22f27f300..0b38c0537b 100644 --- a/src/bin/pg_dump/t/002_pg_dump.pl +++ b/src/bin/pg_dump/t/002_pg_dump.pl @@ -1837,9 +1837,9 @@ my %tests = ( 'CREATE COLLATION icu_collation' => { create_order => 76, - create_sql => "CREATE COLLATION icu_collation (PROVIDER = icu, LOCALE = 'C');", + create_sql => "CREATE COLLATION icu_collation (PROVIDER = icu, LOCALE = 'en-US-u-va-posix');", regexp => - qr/CREATE COLLATION public.icu_collation \(provider = icu, locale = 'C'(, version = '[^']*')?\);/m, + qr/CREATE COLLATION public.icu_collation \(provider = icu, locale = 'en-US-u-va-posix'(, version = '[^']*')?\);/m, icu => 1, like => { %full_runs, section_pre_data => 1, }, }, diff --git a/src/include/catalog/catversion.h b/src/include/catalog/catversion.h index e94528a7c7..d993539dfe 100644 --- a/src/include/catalog/catversion.h +++ b/src/include/catalog/catversion.h @@ -57,6 +57,6 @@ */ /* yyyymmddN */ -#define CATALOG_VERSION_NO 202303181 +#define CATALOG_VERSION_NO 202303211 #endif diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat index 5cf87aeb2c..43db94557d 100644 --- a/src/include/catalog/pg_proc.dat +++ b/src/include/catalog/pg_proc.dat @@ -11838,6 +11838,11 @@ proname => 'pg_database_collation_actual_version', procost => '100', provolatile => 'v', prorettype => 'text', proargtypes => 'oid', prosrc => 'pg_database_collation_actual_version' }, +{ oid => '6273', + descr => 'get BCP47 language tag representation of locale', + proname => 'pg_language_tag', procost => '100', + provolatile => 's', prorettype => 'text', proargtypes => 'text', + prosrc => 'pg_language_tag' }, # system management/monitoring related functions { oid => '3353', descr => 'list files in the log directory', diff --git a/src/include/utils/pg_locale.h b/src/include/utils/pg_locale.h index dd822a68be..ae9077c9bc 100644 --- a/src/include/utils/pg_locale.h +++ b/src/include/utils/pg_locale.h @@ -121,6 +121,7 @@ extern size_t pg_strnxfrm_prefix(char *dest, size_t destsize, const char *src, #ifdef USE_ICU extern int32_t icu_to_uchar(UChar **buff_uchar, const char *buff, size_t nbytes); extern int32_t icu_from_uchar(char **result, const UChar *buff_uchar, int32_t len_uchar); +extern char *icu_language_tag(const char *loc_str, bool noError); #endif extern void check_icu_locale(const char *icu_locale); diff --git a/src/test/regress/expected/collate.icu.utf8.out b/src/test/regress/expected/collate.icu.utf8.out index f135200c99..d8e6240cd7 100644 --- a/src/test/regress/expected/collate.icu.utf8.out +++ b/src/test/regress/expected/collate.icu.utf8.out @@ -11,6 +11,61 @@ SELECT getdatabaseencoding() <> 'UTF8' OR SET client_encoding TO UTF8; CREATE SCHEMA collate_tests; SET search_path = collate_tests; +-- test language tag canonicalization +SELECT pg_language_tag('en_US'); + pg_language_tag +----------------- + en-US +(1 row) + +SELECT pg_language_tag('nonsense'); + pg_language_tag +----------------- + nonsense +(1 row) + +SELECT pg_language_tag('C.UTF-8'); + pg_language_tag +------------------ + en-US-u-va-posix +(1 row) + +SELECT pg_language_tag('POSIX'); + pg_language_tag +------------------ + en-US-u-va-posix +(1 row) + +SELECT pg_language_tag('en_US_POSIX'); + pg_language_tag +------------------ + en-US-u-va-posix +(1 row) + +SELECT pg_language_tag('@colStrength=secondary'); + pg_language_tag +----------------- + und-u-ks-level2 +(1 row) + +SELECT pg_language_tag(''); + pg_language_tag +----------------- + und +(1 row) + +SELECT pg_language_tag('fr_CA.UTF-8'); + pg_language_tag +----------------- + fr-CA +(1 row) + +SELECT pg_language_tag('en_US@colStrength=primary'); + pg_language_tag +------------------- + en-US-u-ks-level1 +(1 row) + CREATE TABLE collate_test1 ( a int, b text COLLATE "en-x-icu" NOT NULL diff --git a/src/test/regress/sql/collate.icu.utf8.sql b/src/test/regress/sql/collate.icu.utf8.sql index 8105ebc8ae..c7241c739a 100644 --- a/src/test/regress/sql/collate.icu.utf8.sql +++ b/src/test/regress/sql/collate.icu.utf8.sql @@ -15,6 +15,16 @@ SET client_encoding TO UTF8; CREATE SCHEMA collate_tests; SET search_path = collate_tests; +-- test language tag canonicalization +SELECT pg_language_tag('en_US'); +SELECT pg_language_tag('nonsense'); +SELECT pg_language_tag('C.UTF-8'); +SELECT pg_language_tag('POSIX'); +SELECT pg_language_tag('en_US_POSIX'); +SELECT pg_language_tag('@colStrength=secondary'); +SELECT pg_language_tag(''); +SELECT pg_language_tag('fr_CA.UTF-8'); +SELECT pg_language_tag('en_US@colStrength=primary'); CREATE TABLE collate_test1 ( a int, -- 2.34.1
From c39c714fecd2fcf304e2fc8a120b9f6851cd78db Mon Sep 17 00:00:00 2001 From: Jeff Davis <j...@j-davis.com> Date: Wed, 22 Mar 2023 10:06:23 -0700 Subject: [PATCH v7 5/7] initdb: emit message when using default ICU locale. Also, minor cleanup to separate the code that chooses the default ICU locale from the code that verifies that a specified locale can be opened with ucol_open(). This cleanup creates a better place for an important comment. --- src/bin/initdb/initdb.c | 77 ++++++++++++++++++++++++++++------------- 1 file changed, 52 insertions(+), 25 deletions(-) diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c index d48b7b6060..7f857f6075 100644 --- a/src/bin/initdb/initdb.c +++ b/src/bin/initdb/initdb.c @@ -2039,46 +2039,73 @@ check_icu_locale_encoding(int user_enc) return true; } +#ifdef USE_ICU + /* - * Check that ICU accepts the locale name; or if not specified, retrieve the - * default ICU locale. + * Determine default ICU locale by opening the default collator and reading + * its locale. + * + * NB: The default collator (opened using NULL) is different from the collator + * for the root locale (opened with "", "und", or "root"). The former depends + * on the environment (useful at initdb time) and the latter does not. */ -static void -check_icu_locale(void) +static char * +default_icu_locale(void) { -#ifdef USE_ICU UCollator *collator; UErrorCode status; + const char *valid_locale; + char *default_locale; status = U_ZERO_ERROR; - collator = ucol_open(icu_locale, &status); + collator = ucol_open(NULL, &status); + if (U_FAILURE(status)) + pg_fatal("could not open collator for default locale: %s", + u_errorName(status)); + + status = U_ZERO_ERROR; + valid_locale = ucol_getLocaleByType(collator, ULOC_VALID_LOCALE, + &status); if (U_FAILURE(status)) { - if (icu_locale) - pg_fatal("could not open collator for locale \"%s\": %s", - icu_locale, u_errorName(status)); - else - pg_fatal("could not open collator for default locale: %s", - u_errorName(status)); + ucol_close(collator); + pg_fatal("could not determine default ICU locale"); } - /* if not specified, get locale from default collator */ - if (icu_locale == NULL) - { - const char *default_locale; + default_locale = pg_strdup(valid_locale); - status = U_ZERO_ERROR; - default_locale = ucol_getLocaleByType(collator, ULOC_VALID_LOCALE, - &status); - if (U_FAILURE(status)) - { - ucol_close(collator); - pg_fatal("could not determine default ICU locale"); - } + ucol_close(collator); - icu_locale = pg_strdup(default_locale); + return default_locale; +} + +#endif + +/* + * If not specified, assign the default locale. Then check that ICU accepts + * the locale. + */ +static void +check_icu_locale(void) +{ +#ifdef USE_ICU + UCollator *collator; + UErrorCode status; + + /* acquire default locale from the environment, if not specified */ + if (icu_locale == NULL) + { + icu_locale = default_icu_locale(); + printf(_("Using default ICU locale \"%s\".\n"), icu_locale); } + /* check that the resulting locale can be opened */ + status = U_ZERO_ERROR; + collator = ucol_open(icu_locale, &status); + if (U_FAILURE(status)) + pg_fatal("could not open collator for locale \"%s\": %s", + icu_locale, u_errorName(status)); + ucol_close(collator); #endif } -- 2.34.1
From ba447ed36dc028e9fb1a0a63d392531223bb8ffc Mon Sep 17 00:00:00 2001 From: Jeff Davis <j...@j-davis.com> Date: Wed, 15 Mar 2023 12:37:06 -0700 Subject: [PATCH v7 6/7] Canonicalize ICU locale names to language tags. Convert to BCP47 language tags before storing in the catalog, except during binary upgrade or when the locale comes from an existing collation or template database. Canonicalization is important, because it's able to handle more kinds of locale strings than ucol_open(). Without canonicalizing first, a locale string like "fr_CA.UTF-8" will be misinterpreted by ucol_open(). The resulting language tags can vary slightly between ICU versions. For instance, "@colBackwards=yes" is converted to "und-u-kb-true" in older versions of ICU, and to the simpler (but equivalent) "und-u-kb" in newer versions. Discussion: https://postgr.es/m/8c7af6820aed94dc7bc259d2aa7f9663518e6137.ca...@j-davis.com --- src/backend/commands/collationcmds.c | 38 +++++++++++ src/backend/commands/dbcommands.c | 32 +++++++++ src/backend/utils/adt/pg_locale.c | 9 --- src/bin/initdb/initdb.c | 68 ++++++++++++++++++- .../regress/expected/collate.icu.utf8.out | 25 ++++++- src/test/regress/sql/collate.icu.utf8.sql | 13 ++++ 6 files changed, 173 insertions(+), 12 deletions(-) diff --git a/src/backend/commands/collationcmds.c b/src/backend/commands/collationcmds.c index ca1d46669f..bcddd1d536 100644 --- a/src/backend/commands/collationcmds.c +++ b/src/backend/commands/collationcmds.c @@ -165,6 +165,11 @@ DefineCollation(ParseState *pstate, List *names, List *parameters, bool if_not_e else colliculocale = NULL; + /* + * When the ICU locale comes from an existing collation, do not + * canonicalize to a language tag. + */ + datum = SysCacheGetAttr(COLLOID, tp, Anum_pg_collation_collicurules, &isnull); if (!isnull) collicurules = TextDatumGetCString(datum); @@ -254,10 +259,43 @@ DefineCollation(ParseState *pstate, List *names, List *parameters, bool if_not_e } else if (collprovider == COLLPROVIDER_ICU) { +#ifdef USE_ICU + char *langtag; + if (!colliculocale) ereport(ERROR, (errcode(ERRCODE_INVALID_OBJECT_DEFINITION), errmsg("parameter \"locale\" must be specified"))); + + check_icu_locale(colliculocale); + + /* + * During binary upgrade, preserve the locale string. Otherwise, + * canonicalize to a language tag. + */ + if (!IsBinaryUpgrade) + { + langtag = icu_language_tag(colliculocale, true); + if (langtag) + { + ereport(NOTICE, + (errmsg("using language tag \"%s\" for locale \"%s\"", + langtag, colliculocale))); + + colliculocale = langtag; + } + else + { + ereport(WARNING, + (errmsg("could not convert locale \"%s\" to language tag", + colliculocale))); + } + } +#else + ereport(ERROR, + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), + errmsg("ICU is not supported in this build"))); +#endif } /* diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c index 4d5d5d6866..5935477f44 100644 --- a/src/backend/commands/dbcommands.c +++ b/src/backend/commands/dbcommands.c @@ -1043,6 +1043,9 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt) if (dblocprovider == COLLPROVIDER_ICU) { +#ifdef USE_ICU + char *langtag; + if (!(is_encoding_supported_by_icu(encoding))) ereport(ERROR, (errcode(ERRCODE_INVALID_PARAMETER_VALUE), @@ -1059,6 +1062,35 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt) errmsg("ICU locale must be specified"))); check_icu_locale(dbiculocale); + + /* + * During binary upgrade, or when the locale came from the template + * database, preserve locale string. Otherwise, canonicalize to a + * language tag. + */ + if (!IsBinaryUpgrade && dbiculocale != src_iculocale) + { + langtag = icu_language_tag(dbiculocale, true); + if (langtag) + { + ereport(NOTICE, + (errmsg("using language tag \"%s\" for locale \"%s\"", + langtag, dbiculocale))); + + dbiculocale = langtag; + } + else + { + ereport(WARNING, + (errmsg("could not convert locale \"%s\" to language tag", + dbiculocale))); + } + } +#else + ereport(ERROR, + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), + errmsg("ICU is not supported in this build"))); +#endif } else { diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c index baafc71a3d..7a5376dc75 100644 --- a/src/backend/utils/adt/pg_locale.c +++ b/src/backend/utils/adt/pg_locale.c @@ -2820,27 +2820,18 @@ icu_set_collation_attributes(UCollator *collator, const char *loc) pfree(lower_str); } -#endif /* USE_ICU */ - /* * Check if the given locale ID is valid, and ereport(ERROR) if it isn't. */ void check_icu_locale(const char *icu_locale) { -#ifdef USE_ICU UCollator *collator; collator = pg_ucol_open(icu_locale); ucol_close(collator); -#else - ereport(ERROR, - (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), - errmsg("ICU is not supported in this build"))); -#endif } -#ifdef USE_ICU /* * Return the BCP47 language tag representation of the requested locale. * diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c index 7f857f6075..609bd9ea1e 100644 --- a/src/bin/initdb/initdb.c +++ b/src/bin/initdb/initdb.c @@ -2079,11 +2079,67 @@ default_icu_locale(void) return default_locale; } +/* + * Convert to canonical BCP47 language tag. Must be consistent with + * icu_language_tag(). + */ +static char * +icu_language_tag(const char *loc_str) +{ + UErrorCode status; + char lang[ULOC_LANG_CAPACITY]; + char *langtag; + size_t buflen = 32; /* arbitrary starting buffer size */ + const bool strict = true; + + status = U_ZERO_ERROR; + uloc_getLanguage(loc_str, lang, ULOC_LANG_CAPACITY, &status); + if (U_FAILURE(status)) + { + pg_fatal("could not get language from locale \"%s\": %s", + loc_str, u_errorName(status)); + } + + /* C/POSIX locales aren't handled by uloc_getLanguageTag() */ + if (strcmp(lang, "c") == 0 || strcmp(lang, "posix") == 0) + return pstrdup("en-US-u-va-posix"); + + /* + * A BCP47 language tag doesn't have a clearly-defined upper limit + * (cf. RFC5646 section 4.4). Additionally, in older ICU versions, + * uloc_toLanguageTag() doesn't always return the ultimate length on the + * first call, necessitating a loop. + */ + langtag = pg_malloc(buflen); + while (true) + { + int32_t len; + + status = U_ZERO_ERROR; + len = uloc_toLanguageTag(loc_str, langtag, buflen, strict, &status); + if (len < buflen) + break; + + buflen = buflen * 2; + langtag = pg_realloc(langtag, buflen); + } + + if (U_FAILURE(status)) + { + pg_free(langtag); + + pg_fatal("could not convert locale name \"%s\" to language tag: %s", + loc_str, u_errorName(status)); + } + + return langtag; +} + #endif /* - * If not specified, assign the default locale. Then check that ICU accepts - * the locale. + * If not specified, assign the default locale. Then convert to a language + * tag, and check that ICU accepts it. */ static void check_icu_locale(void) @@ -2091,6 +2147,7 @@ check_icu_locale(void) #ifdef USE_ICU UCollator *collator; UErrorCode status; + char *langtag; /* acquire default locale from the environment, if not specified */ if (icu_locale == NULL) @@ -2099,6 +2156,13 @@ check_icu_locale(void) printf(_("Using default ICU locale \"%s\".\n"), icu_locale); } + /* canonicalize to a language tag */ + langtag = icu_language_tag(icu_locale); + printf(_("Using language tag \"%s\" for ICU locale \"%s\".\n"), + langtag, icu_locale); + pg_free(icu_locale); + icu_locale = langtag; + /* check that the resulting locale can be opened */ status = U_ZERO_ERROR; collator = ucol_open(icu_locale, &status); diff --git a/src/test/regress/expected/collate.icu.utf8.out b/src/test/regress/expected/collate.icu.utf8.out index d8e6240cd7..730decc4cb 100644 --- a/src/test/regress/expected/collate.icu.utf8.out +++ b/src/test/regress/expected/collate.icu.utf8.out @@ -1074,6 +1074,7 @@ reset enable_seqscan; CREATE ROLE regress_test_role; CREATE SCHEMA test_schema; -- We need to do this this way to cope with varying names for encodings: +SET client_min_messages TO WARNING; do $$ BEGIN EXECUTE 'CREATE COLLATION test0 (provider = icu, locale = ' || @@ -1088,9 +1089,11 @@ BEGIN quote_literal(current_setting('lc_collate')) || ');'; END $$; +RESET client_min_messages; CREATE COLLATION test3 (provider = icu, lc_collate = 'en_US.utf8'); -- fail, needs "locale" ERROR: parameter "locale" must be specified CREATE COLLATION testx (provider = icu, locale = 'nonsense'); /* never fails with ICU */ DROP COLLATION testx; +NOTICE: using language tag "nonsense" for locale "nonsense" CREATE COLLATION test4 FROM nonsense; ERROR: collation "nonsense" for encoding "UTF8" does not exist CREATE COLLATION test5 FROM test0; @@ -1217,14 +1220,18 @@ SELECT * FROM collate_test2 ORDER BY b COLLATE UNICODE; -- test ICU collation customization -- test the attributes handled by icu_set_collation_attributes() +SET client_min_messages=WARNING; CREATE COLLATION testcoll_ignore_accents (provider = icu, locale = '@colStrength=primary;colCaseLevel=yes'); +RESET client_min_messages; SELECT 'aaá' > 'AAA' COLLATE "und-x-icu", 'aaá' < 'AAA' COLLATE testcoll_ignore_accents; ?column? | ?column? ----------+---------- t | t (1 row) +SET client_min_messages=WARNING; CREATE COLLATION testcoll_backwards (provider = icu, locale = '@colBackwards=yes'); +RESET client_min_messages; SELECT 'coté' < 'côte' COLLATE "und-x-icu", 'coté' > 'côte' COLLATE testcoll_backwards; ?column? | ?column? ----------+---------- @@ -1232,7 +1239,9 @@ SELECT 'coté' < 'côte' COLLATE "und-x-icu", 'coté' > 'côte' COLLATE testcoll (1 row) CREATE COLLATION testcoll_lower_first (provider = icu, locale = '@colCaseFirst=lower'); +NOTICE: using language tag "und-u-kf-lower" for locale "@colCaseFirst=lower" CREATE COLLATION testcoll_upper_first (provider = icu, locale = '@colCaseFirst=upper'); +NOTICE: using language tag "und-u-kf-upper" for locale "@colCaseFirst=upper" SELECT 'aaa' < 'AAA' COLLATE testcoll_lower_first, 'aaa' > 'AAA' COLLATE testcoll_upper_first; ?column? | ?column? ----------+---------- @@ -1240,13 +1249,16 @@ SELECT 'aaa' < 'AAA' COLLATE testcoll_lower_first, 'aaa' > 'AAA' COLLATE testcol (1 row) CREATE COLLATION testcoll_shifted (provider = icu, locale = '@colAlternate=shifted'); +NOTICE: using language tag "und-u-ka-shifted" for locale "@colAlternate=shifted" SELECT 'de-luge' < 'deanza' COLLATE "und-x-icu", 'de-luge' > 'deanza' COLLATE testcoll_shifted; ?column? | ?column? ----------+---------- t | t (1 row) +SET client_min_messages=WARNING; CREATE COLLATION testcoll_numeric (provider = icu, locale = '@colNumeric=yes'); +RESET client_min_messages; SELECT 'A-21' > 'A-123' COLLATE "und-x-icu", 'A-21' < 'A-123' COLLATE testcoll_numeric; ?column? | ?column? ----------+---------- @@ -1258,6 +1270,7 @@ ERROR: could not open collator for locale "@colNumeric=lower": U_ILLEGAL_ARGUME -- test that attributes not handled by icu_set_collation_attributes() -- (handled by ucol_open() directly) also work CREATE COLLATION testcoll_de_phonebook (provider = icu, locale = 'de@collation=phonebook'); +NOTICE: using language tag "de-u-co-phonebk" for locale "de@collation=phonebook" SELECT 'Goldmann' < 'Götz' COLLATE "de-x-icu", 'Goldmann' > 'Götz' COLLATE testcoll_de_phonebook; ?column? | ?column? ----------+---------- @@ -1266,6 +1279,7 @@ SELECT 'Goldmann' < 'Götz' COLLATE "de-x-icu", 'Goldmann' > 'Götz' COLLATE tes -- rules CREATE COLLATION testcoll_rules1 (provider = icu, locale = '', rules = '&a < g'); +NOTICE: using language tag "und" for locale "" CREATE TABLE test7 (a text); -- example from https://unicode-org.github.io/icu/userguide/collation/customization/#syntax INSERT INTO test7 VALUES ('Abernathy'), ('apple'), ('bird'), ('Boston'), ('Graham'), ('green'); @@ -1293,10 +1307,13 @@ SELECT * FROM test7 ORDER BY a COLLATE testcoll_rules1; DROP TABLE test7; CREATE COLLATION testcoll_rulesx (provider = icu, locale = '', rules = '!!wrong!!'); -ERROR: could not open collator for locale "" with rules "!!wrong!!": U_INVALID_FORMAT_ERROR +NOTICE: using language tag "und" for locale "" +ERROR: could not open collator for locale "und" with rules "!!wrong!!": U_INVALID_FORMAT_ERROR -- nondeterministic collations CREATE COLLATION ctest_det (provider = icu, locale = '', deterministic = true); +NOTICE: using language tag "und" for locale "" CREATE COLLATION ctest_nondet (provider = icu, locale = '', deterministic = false); +NOTICE: using language tag "und" for locale "" CREATE TABLE test6 (a int, b text); -- same string in different normal forms INSERT INTO test6 VALUES (1, U&'\00E4bc'); @@ -1346,7 +1363,9 @@ SELECT * FROM test6a WHERE b = ARRAY['äbc'] COLLATE ctest_nondet; (2 rows) CREATE COLLATION case_sensitive (provider = icu, locale = ''); +NOTICE: using language tag "und" for locale "" CREATE COLLATION case_insensitive (provider = icu, locale = '@colStrength=secondary', deterministic = false); +NOTICE: using language tag "und-u-ks-level2" for locale "@colStrength=secondary" SELECT 'abc' <= 'ABC' COLLATE case_sensitive, 'abc' >= 'ABC' COLLATE case_sensitive; ?column? | ?column? ----------+---------- @@ -1361,6 +1380,7 @@ SELECT 'abc' <= 'ABC' COLLATE case_insensitive, 'abc' >= 'ABC' COLLATE case_inse -- test language tags CREATE COLLATION lt_insensitive (provider = icu, locale = 'en-u-ks-level1', deterministic = false); +NOTICE: using language tag "en-u-ks-level1" for locale "en-u-ks-level1" SELECT 'aBcD' COLLATE lt_insensitive = 'AbCd' COLLATE lt_insensitive; ?column? ---------- @@ -1368,6 +1388,7 @@ SELECT 'aBcD' COLLATE lt_insensitive = 'AbCd' COLLATE lt_insensitive; (1 row) CREATE COLLATION lt_upperfirst (provider = icu, locale = 'und-u-kf-upper'); +NOTICE: using language tag "und-u-kf-upper" for locale "und-u-kf-upper" SELECT 'Z' COLLATE lt_upperfirst < 'z' COLLATE lt_upperfirst; ?column? ---------- @@ -1828,7 +1849,9 @@ SELECT * FROM outer_text WHERE (f1, f2) NOT IN (SELECT * FROM inner_text); (2 rows) -- accents +SET client_min_messages=WARNING; CREATE COLLATION ignore_accents (provider = icu, locale = '@colStrength=primary;colCaseLevel=yes', deterministic = false); +RESET client_min_messages; CREATE TABLE test4 (a int, b text); INSERT INTO test4 VALUES (1, 'cote'), (2, 'côte'), (3, 'coté'), (4, 'côté'); SELECT * FROM test4 WHERE b = 'cote'; diff --git a/src/test/regress/sql/collate.icu.utf8.sql b/src/test/regress/sql/collate.icu.utf8.sql index c7241c739a..5f3a88a404 100644 --- a/src/test/regress/sql/collate.icu.utf8.sql +++ b/src/test/regress/sql/collate.icu.utf8.sql @@ -367,6 +367,8 @@ CREATE ROLE regress_test_role; CREATE SCHEMA test_schema; -- We need to do this this way to cope with varying names for encodings: +SET client_min_messages TO WARNING; + do $$ BEGIN EXECUTE 'CREATE COLLATION test0 (provider = icu, locale = ' || @@ -380,6 +382,9 @@ BEGIN quote_literal(current_setting('lc_collate')) || ');'; END $$; + +RESET client_min_messages; + CREATE COLLATION test3 (provider = icu, lc_collate = 'en_US.utf8'); -- fail, needs "locale" CREATE COLLATION testx (provider = icu, locale = 'nonsense'); /* never fails with ICU */ DROP COLLATION testx; @@ -464,10 +469,14 @@ SELECT * FROM collate_test2 ORDER BY b COLLATE UNICODE; -- test the attributes handled by icu_set_collation_attributes() +SET client_min_messages=WARNING; CREATE COLLATION testcoll_ignore_accents (provider = icu, locale = '@colStrength=primary;colCaseLevel=yes'); +RESET client_min_messages; SELECT 'aaá' > 'AAA' COLLATE "und-x-icu", 'aaá' < 'AAA' COLLATE testcoll_ignore_accents; +SET client_min_messages=WARNING; CREATE COLLATION testcoll_backwards (provider = icu, locale = '@colBackwards=yes'); +RESET client_min_messages; SELECT 'coté' < 'côte' COLLATE "und-x-icu", 'coté' > 'côte' COLLATE testcoll_backwards; CREATE COLLATION testcoll_lower_first (provider = icu, locale = '@colCaseFirst=lower'); @@ -477,7 +486,9 @@ SELECT 'aaa' < 'AAA' COLLATE testcoll_lower_first, 'aaa' > 'AAA' COLLATE testcol CREATE COLLATION testcoll_shifted (provider = icu, locale = '@colAlternate=shifted'); SELECT 'de-luge' < 'deanza' COLLATE "und-x-icu", 'de-luge' > 'deanza' COLLATE testcoll_shifted; +SET client_min_messages=WARNING; CREATE COLLATION testcoll_numeric (provider = icu, locale = '@colNumeric=yes'); +RESET client_min_messages; SELECT 'A-21' > 'A-123' COLLATE "und-x-icu", 'A-21' < 'A-123' COLLATE testcoll_numeric; CREATE COLLATION testcoll_error1 (provider = icu, locale = '@colNumeric=lower'); @@ -666,7 +677,9 @@ INSERT INTO inner_text VALUES ('a', NULL); SELECT * FROM outer_text WHERE (f1, f2) NOT IN (SELECT * FROM inner_text); -- accents +SET client_min_messages=WARNING; CREATE COLLATION ignore_accents (provider = icu, locale = '@colStrength=primary;colCaseLevel=yes', deterministic = false); +RESET client_min_messages; CREATE TABLE test4 (a int, b text); INSERT INTO test4 VALUES (1, 'cote'), (2, 'côte'), (3, 'coté'), (4, 'côté'); -- 2.34.1
From 9c2929ef42f9d3c53eaf00c4390b087594475740 Mon Sep 17 00:00:00 2001 From: Jeff Davis <j...@j-davis.com> Date: Fri, 17 Mar 2023 09:55:31 -0700 Subject: [PATCH v7 7/7] Validate ICU locales. Ensure that it can be transformed into a language tag in "strict" mode (which validates the attributes), and also that the language exists in ICU. Basic validation helps avoid minor mistakes and misspellings, which often fall back to the root locale instead of the intended locale. It's even more important in ICU versions 54 and earlier, where the same (misspelled) locale string could fall back to different locales depending on the environment. Discussion: https://postgr.es/m/11b1eeb7e7667fdd4178497aeb796c48d26e69b9.ca...@j-davis.com Discussion: https://postgr.es/m/df2efad0cae7c65180df8e5ebb709e5eb4f2a82b.ca...@j-davis.com --- doc/src/sgml/config.sgml | 17 +++++ src/backend/commands/collationcmds.c | 8 +-- src/backend/commands/dbcommands.c | 8 +-- src/backend/utils/adt/pg_locale.c | 64 +++++++++++++++++++ src/backend/utils/misc/guc_tables.c | 10 +++ src/backend/utils/misc/postgresql.conf.sample | 2 + src/include/utils/pg_locale.h | 1 + .../regress/expected/collate.icu.utf8.out | 10 ++- src/test/regress/sql/collate.icu.utf8.sql | 6 +- 9 files changed, 112 insertions(+), 14 deletions(-) diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index 481f93cea1..78eae3ca65 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -9784,6 +9784,23 @@ SET XML OPTION { DOCUMENT | CONTENT }; </listitem> </varlistentry> + <varlistentry id="guc-icu-locale-validation" xreflabel="icu_locale_validation"> + <term><varname>icu_locale_validation</varname> (<type>boolean</type>) + <indexterm> + <primary><varname>icu_locale_validation</varname> configuration parameter</primary> + </indexterm> + </term> + <listitem> + <para> + Validation is performed on an ICU locale specified for a new collation + or database. If this parameter is set to <literal>true</literal>, an + error is raised for a validation failure; if set to + <literal>false</literal>, a warning is issued. The default is + <literal>false</literal>. + </para> + </listitem> + </varlistentry> + <varlistentry id="guc-default-text-search-config" xreflabel="default_text_search_config"> <term><varname>default_text_search_config</varname> (<type>string</type>) <indexterm> diff --git a/src/backend/commands/collationcmds.c b/src/backend/commands/collationcmds.c index bcddd1d536..90f7aabc88 100644 --- a/src/backend/commands/collationcmds.c +++ b/src/backend/commands/collationcmds.c @@ -284,12 +284,8 @@ DefineCollation(ParseState *pstate, List *names, List *parameters, bool if_not_e colliculocale = langtag; } - else - { - ereport(WARNING, - (errmsg("could not convert locale \"%s\" to language tag", - colliculocale))); - } + + icu_validate_locale(colliculocale); } #else ereport(ERROR, diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c index 5935477f44..600b3a0f61 100644 --- a/src/backend/commands/dbcommands.c +++ b/src/backend/commands/dbcommands.c @@ -1079,12 +1079,8 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt) dbiculocale = langtag; } - else - { - ereport(WARNING, - (errmsg("could not convert locale \"%s\" to language tag", - dbiculocale))); - } + + icu_validate_locale(dbiculocale); } #else ereport(ERROR, diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c index 7a5376dc75..b651b99707 100644 --- a/src/backend/utils/adt/pg_locale.c +++ b/src/backend/utils/adt/pg_locale.c @@ -88,6 +88,7 @@ #define MAX_L10N_DATA 80 +extern bool icu_locale_validation; /* GUC settings */ char *locale_messages; @@ -2897,6 +2898,69 @@ icu_language_tag(const char *loc_str, bool noError) return langtag; } +/* + * Perform best-effort check that the locale is a valid one. + */ +void +icu_validate_locale(const char *loc_str) +{ + UErrorCode status; + int elevel = icu_locale_validation ? ERROR : WARNING; + char *langtag = icu_language_tag(loc_str, true); + char lang[ULOC_LANG_CAPACITY]; + + /* check that it can be converted to a language tag */ + if (langtag == NULL) + { + ereport(elevel, + (errmsg("could not convert locale \"%s\" to language tag", + loc_str))); + return; + } + pfree(langtag); + + /* validate that we can extract the language */ + status = U_ZERO_ERROR; + uloc_getLanguage(loc_str, lang, ULOC_LANG_CAPACITY, &status); + if (U_FAILURE(status)) + { + ereport(elevel, + (errmsg("could not get language from locale \"%s\": %s", + loc_str, u_errorName(status)))); + return; + } + + /* check for special languages */ + if (strcmp(lang, "") == 0 || + strcmp(lang, "root") == 0 || strcmp(lang, "und") == 0 || + strcmp(lang, "c") == 0 || strcmp(lang, "posix") == 0) + return; + + /* search for matching language within ICU */ + for (int32_t i = 0; i < uloc_countAvailable(); i++) + { + const char *otherloc = uloc_getAvailable(i); + char otherlang[ULOC_LANG_CAPACITY]; + + status = U_ZERO_ERROR; + uloc_getLanguage(otherloc, otherlang, ULOC_LANG_CAPACITY, &status); + if (U_FAILURE(status)) + { + ereport(elevel, + (errmsg("could not get language from locale \"%s\": %s", + loc_str, u_errorName(status)))); + continue; + } + + if (strcmp(lang, otherlang) == 0) + return; + } + + ereport(elevel, + (errmsg("language \"%s\" of locale \"%s\" not found", + lang, loc_str))); +} + #endif /* USE_ICU */ /* diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c index 1c0583fe26..1c63ed0d21 100644 --- a/src/backend/utils/misc/guc_tables.c +++ b/src/backend/utils/misc/guc_tables.c @@ -481,6 +481,7 @@ char *event_source; bool row_security; bool check_function_bodies = true; +bool icu_locale_validation = false; /* * This GUC exists solely for backward compatibility, check its definition for @@ -1586,6 +1587,15 @@ struct config_bool ConfigureNamesBool[] = true, NULL, NULL, NULL }, + { + {"icu_locale_validation", PGC_USERSET, CLIENT_CONN_LOCALE, + gettext_noop("Raise an error for invalid ICU locale strings."), + NULL + }, + &icu_locale_validation, + false, + NULL, NULL, NULL + }, { {"array_nulls", PGC_USERSET, COMPAT_OPTIONS_PREVIOUS, gettext_noop("Enable input of NULL elements in arrays."), diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample index d06074b86f..cff927e8be 100644 --- a/src/backend/utils/misc/postgresql.conf.sample +++ b/src/backend/utils/misc/postgresql.conf.sample @@ -730,6 +730,8 @@ #lc_numeric = 'C' # locale for number formatting #lc_time = 'C' # locale for time formatting +#icu_locale_validation = off # validate ICU locale strings + # default configuration for text search #default_text_search_config = 'pg_catalog.simple' diff --git a/src/include/utils/pg_locale.h b/src/include/utils/pg_locale.h index ae9077c9bc..076665dfc3 100644 --- a/src/include/utils/pg_locale.h +++ b/src/include/utils/pg_locale.h @@ -122,6 +122,7 @@ extern size_t pg_strnxfrm_prefix(char *dest, size_t destsize, const char *src, extern int32_t icu_to_uchar(UChar **buff_uchar, const char *buff, size_t nbytes); extern int32_t icu_from_uchar(char **result, const UChar *buff_uchar, int32_t len_uchar); extern char *icu_language_tag(const char *loc_str, bool noError); +extern void icu_validate_locale(const char *loc_str); #endif extern void check_icu_locale(const char *icu_locale); diff --git a/src/test/regress/expected/collate.icu.utf8.out b/src/test/regress/expected/collate.icu.utf8.out index 730decc4cb..5eeceb7e02 100644 --- a/src/test/regress/expected/collate.icu.utf8.out +++ b/src/test/regress/expected/collate.icu.utf8.out @@ -1092,8 +1092,16 @@ $$; RESET client_min_messages; CREATE COLLATION test3 (provider = icu, lc_collate = 'en_US.utf8'); -- fail, needs "locale" ERROR: parameter "locale" must be specified -CREATE COLLATION testx (provider = icu, locale = 'nonsense'); /* never fails with ICU */ DROP COLLATION testx; +SET icu_locale_validation = true; +CREATE COLLATION testx (provider = icu, locale = 'nonsense'); -- fails NOTICE: using language tag "nonsense" for locale "nonsense" +ERROR: language "nonsense" of locale "nonsense" not found +CREATE COLLATION testx (provider = icu, locale = '@colStrength=primary;nonsense=yes'); -- fails +ERROR: could not convert locale "@colStrength=primary;nonsense=yes" to language tag +RESET icu_locale_validation; +CREATE COLLATION testx (provider = icu, locale = 'nonsense'); DROP COLLATION testx; +NOTICE: using language tag "nonsense" for locale "nonsense" +WARNING: language "nonsense" of locale "nonsense" not found CREATE COLLATION test4 FROM nonsense; ERROR: collation "nonsense" for encoding "UTF8" does not exist CREATE COLLATION test5 FROM test0; diff --git a/src/test/regress/sql/collate.icu.utf8.sql b/src/test/regress/sql/collate.icu.utf8.sql index 5f3a88a404..7d2c91252c 100644 --- a/src/test/regress/sql/collate.icu.utf8.sql +++ b/src/test/regress/sql/collate.icu.utf8.sql @@ -386,7 +386,11 @@ $$; RESET client_min_messages; CREATE COLLATION test3 (provider = icu, lc_collate = 'en_US.utf8'); -- fail, needs "locale" -CREATE COLLATION testx (provider = icu, locale = 'nonsense'); /* never fails with ICU */ DROP COLLATION testx; +SET icu_locale_validation = true; +CREATE COLLATION testx (provider = icu, locale = 'nonsense'); -- fails +CREATE COLLATION testx (provider = icu, locale = '@colStrength=primary;nonsense=yes'); -- fails +RESET icu_locale_validation; +CREATE COLLATION testx (provider = icu, locale = 'nonsense'); DROP COLLATION testx; CREATE COLLATION test4 FROM nonsense; CREATE COLLATION test5 FROM test0; -- 2.34.1