Re: ICU locale validation / canonicalization

Jeff Davis Wed, 22 Mar 2023 11:05:52 -0700

On Tue, 2023-03-21 at 10:35 +0100, Peter Eisentraut wrote:
> [PATCH v6 1/6] Support language tags in older ICU versions (53 and
>   earlier).
> 
> In pg_import_system_collations(), this is now redundant and can be 
> simplified:
> 
> -               if (!pg_is_ascii(langtag) || !pg_is_ascii(iculocstr))
> +               if (!pg_is_ascii(langtag) || !pg_is_ascii(langtag))
> 
> icu_set_collation_attributes() needs more commenting about what is
> going 
> on.  My guess is that uloc_canonicalize() converts from language tag
> to 
> ICU locale ID, and then the existing logic to parse that apart would 
> apply.  Is that how it works?


Fixed the redundancy, added some comments, and committed 0001.

> [PATCH v6 2/6] Wrap ICU ucol_open().
> 
> It makes sense to try to unify some of this.  But I find the naming 
> confusing.  If I see pg_ucol_open(), then I would expect that all
> calls 
> to ucol_open() would be replaced by this.  But here it's only a few, 
> without explanation.  (pg_ucol_open() has no explanation at all
> AFAICT.)

The remaining callsite which doesn't use the wrapper is in initdb.c,
which can't call into pg_locale.c, and has different intentions. initdb
uses ucol_open to get the default locale if icu_locale is not
specified; and it also uses ucol open to verify that the locale can be
opened (whether specified or the default). (Aside: I created a tiny
0004 patch which makes this difference more clear and adds a nice
comment.)

There's no reason to use a wrapper when getting the default locale,
because it's just passing NULL anyway.

When verifying that the locale can be opened, ucol_open() doesn't catch
many problems anyway, so I'm not sure it's worth a lot of effort to
copy these extra checks that the wrapper does into initdb.c. For
instance, what's the value in replacing "und" with "root" if opening
either will succeed? Parsing the attributes can potentially catch
problems, but the later patch 0006 will check the attributes when
converting to a language tag at initdb time.

So I'm inclined to just leave initdb alone in patches 0002 and 0003.

> I have in my notes that check_icu_locale() and make_icu_collator() 
> should be combined into a single function.  I think that would be a 
> better way to slice it.

That would leave out get_collation_actual_version(), which should
handle the same fixups for attributes and the "und" locale.

> Btw., I had intentionally not written code like this
> 
> +#if U_ICU_VERSION_MAJOR_NUM < 54
> +       icu_set_collation_attributes(collator, loc_str);
> +#endif
> 
> The disadvantage of doing it that way is that you then need to dig
> out 
> an old version of ICU in order to check whether the code compiles at 
> all.  With the current code, you can be sure that that code compiles
> if 
> you make changes elsewhere.

I was wondering about that -- thank you, I changed it back to use "if"
rather than "#ifdef".


New series attached (starting at 0002 to better correspond to the
previous series).


-- 
Jeff Davis
PostgreSQL Contributor Team - AWS

From fbe03dc596b5e12f4dda60269e044caa58f8be32 Mon Sep 17 00:00:00 2001
From: Jeff Davis <j...@j-davis.com>
Date: Tue, 14 Mar 2023 21:21:17 -0700
Subject: [PATCH v7 2/7] Wrap ICU ucol_open().

Hide details of supporting older ICU versions in a wrapper
function. The current code only needs to handle
icu_set_collation_attributes(), but a subsequent commit will add
additional version-specific code.
---
 src/backend/utils/adt/pg_locale.c | 70 +++++++++++++++++++------------
 1 file changed, 43 insertions(+), 27 deletions(-)

diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index c3ede994be..dd0786dff5 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -140,6 +140,7 @@ static char *IsoLocaleName(const char *);
  */
 static UConverter *icu_converter = NULL;
 
+static UCollator *pg_ucol_open(const char *loc_str);
 static void init_icu_converter(void);
 static size_t uchar_length(UConverter *converter,
 						   const char *str, int32_t len);
@@ -1430,17 +1431,8 @@ make_icu_collator(const char *iculocstr,
 {
 #ifdef USE_ICU
 	UCollator  *collator;
-	UErrorCode	status;
 
-	status = U_ZERO_ERROR;
-	collator = ucol_open(iculocstr, &status);
-	if (U_FAILURE(status))
-		ereport(ERROR,
-				(errmsg("could not open collator for locale \"%s\": %s",
-						iculocstr, u_errorName(status))));
-
-	if (U_ICU_VERSION_MAJOR_NUM < 54)
-		icu_set_collation_attributes(collator, iculocstr);
+	collator = pg_ucol_open(iculocstr);
 
 	/*
 	 * If rules are specified, we extract the rules of the standard collation,
@@ -1451,6 +1443,7 @@ make_icu_collator(const char *iculocstr,
 		const UChar *default_rules;
 		UChar	   *agg_rules;
 		UChar	   *my_rules;
+		UErrorCode	status;
 		int32_t		length;
 
 		default_rules = ucol_getRules(collator, &length);
@@ -1722,16 +1715,11 @@ get_collation_actual_version(char collprovider, const char *collcollate)
 	if (collprovider == COLLPROVIDER_ICU)
 	{
 		UCollator  *collator;
-		UErrorCode	status;
 		UVersionInfo versioninfo;
 		char		buf[U_MAX_VERSION_STRING_LENGTH];
 
-		status = U_ZERO_ERROR;
-		collator = ucol_open(collcollate, &status);
-		if (U_FAILURE(status))
-			ereport(ERROR,
-					(errmsg("could not open collator for locale \"%s\": %s",
-							collcollate, u_errorName(status))));
+		collator = pg_ucol_open(collcollate);
+
 		ucol_getVersion(collator, versioninfo);
 		ucol_close(collator);
 
@@ -2505,6 +2493,43 @@ pg_strnxfrm_prefix(char *dest, size_t destsize, const char *src,
 }
 
 #ifdef USE_ICU
+
+/*
+ * Wrapper around ucol_open() to handle API differences for older ICU
+ * versions.
+ */
+static UCollator *
+pg_ucol_open(const char *loc_str)
+{
+	UCollator  *collator;
+	UErrorCode	status;
+
+	/*
+	 * Must never open default collator, because it depends on the environment
+	 * and may change at any time.
+	 *
+	 * NB: the default collator is not the same as the collator for the root
+	 * locale. The root locale may be specified as the empty string, "und", or
+	 * "root". The default collator is opened by passing NULL to ucol_open().
+	 */
+	if (loc_str == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("opening default collator is not supported")));
+
+	status = U_ZERO_ERROR;
+	collator = ucol_open(loc_str, &status);
+	if (U_FAILURE(status))
+		ereport(ERROR,
+				(errmsg("could not open collator for locale \"%s\": %s",
+						loc_str, u_errorName(status))));
+
+	if (U_ICU_VERSION_MAJOR_NUM < 54)
+		icu_set_collation_attributes(collator, loc_str);
+
+	return collator;
+}
+
 static void
 init_icu_converter(void)
 {
@@ -2771,17 +2796,8 @@ check_icu_locale(const char *icu_locale)
 {
 #ifdef USE_ICU
 	UCollator  *collator;
-	UErrorCode	status;
-
-	status = U_ZERO_ERROR;
-	collator = ucol_open(icu_locale, &status);
-	if (U_FAILURE(status))
-		ereport(ERROR,
-				(errmsg("could not open collator for locale \"%s\": %s",
-						icu_locale, u_errorName(status))));
 
-	if (U_ICU_VERSION_MAJOR_NUM < 54)
-		icu_set_collation_attributes(collator, icu_locale);
+	collator = pg_ucol_open(icu_locale);
 	ucol_close(collator);
 #else
 	ereport(ERROR,
-- 
2.34.1

From 8027572146571609927815d0fe14f761fc86cf2c Mon Sep 17 00:00:00 2001
From: Jeff Davis <j...@j-davis.com>
Date: Tue, 14 Mar 2023 22:28:21 -0700
Subject: [PATCH v7 3/7] Handle the "und" locale in ICU versions 54 and older.

The "und" locale is an alternative spelling of the root locale, but it
was not recognized until ICU 55. To maintain common behavior across
all supported ICU versions, check for "und" and replace with "root"
before opening.

Previously, the lack of support for "und" was dangerous, because
versions 54 and older fall back to the environment when a locale is
not found. If the user specified "und" for the language (which is
expected and documented), it could not only resolve to the wrong
collator, but it could unexpectedly change (which could lead to
corrupt indexes).

This effectively reverts commit d72900bded, which worked around the
problem for the built-in "unicode" collation, and is no longer
necessary.

Discussion: https://postgr.es/m/60da0cecfb512a78b8666b31631a636215d8ce73.ca...@j-davis.com
Discussion: https://postgr.es/m/0c6fa66f2753217d2a40480a96bd2ccf023536a1.ca...@j-davis.com
---
 src/backend/utils/adt/pg_locale.c             | 38 ++++++++++++++++++-
 src/bin/initdb/initdb.c                       |  2 +-
 .../regress/expected/collate.icu.utf8.out     |  7 ++++
 src/test/regress/sql/collate.icu.utf8.sql     |  2 +
 4 files changed, 46 insertions(+), 3 deletions(-)

diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index dd0786dff5..052db11413 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -2501,8 +2501,9 @@ pg_strnxfrm_prefix(char *dest, size_t destsize, const char *src,
 static UCollator *
 pg_ucol_open(const char *loc_str)
 {
-	UCollator  *collator;
-	UErrorCode	status;
+	UCollator	*collator;
+	UErrorCode	 status;
+	char		*fixed_str = NULL;
 
 	/*
 	 * Must never open default collator, because it depends on the environment
@@ -2517,6 +2518,36 @@ pg_ucol_open(const char *loc_str)
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg("opening default collator is not supported")));
 
+	/*
+	 * In ICU versions 54 and earlier, "und" is not a recognized spelling of
+	 * the root locale. If the first component of the locale is "und", replace
+	 * with "root" before opening.
+	 */
+	if (U_ICU_VERSION_MAJOR_NUM < 55)
+	{
+		char  lang[ULOC_LANG_CAPACITY];
+
+		status = U_ZERO_ERROR;
+		uloc_getLanguage(loc_str, lang, ULOC_LANG_CAPACITY, &status);
+		if (U_FAILURE(status))
+		{
+			ereport(ERROR,
+					(errmsg("could not get language from locale \"%s\": %s",
+							loc_str, u_errorName(status))));
+		}
+
+		if (strcmp(lang, "und") == 0)
+		{
+			const char *remainder = loc_str + strlen("und");
+
+			fixed_str = palloc(strlen("root") + strlen(remainder) + 1);
+			strcpy(fixed_str, "root");
+			strcat(fixed_str, remainder);
+
+			loc_str = fixed_str;
+		}
+	}
+
 	status = U_ZERO_ERROR;
 	collator = ucol_open(loc_str, &status);
 	if (U_FAILURE(status))
@@ -2527,6 +2558,9 @@ pg_ucol_open(const char *loc_str)
 	if (U_ICU_VERSION_MAJOR_NUM < 54)
 		icu_set_collation_attributes(collator, loc_str);
 
+	if (fixed_str != NULL)
+		pfree(fixed_str);
+
 	return collator;
 }
 
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 68d430ed63..d48b7b6060 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -1498,7 +1498,7 @@ setup_collation(FILE *cmdfd)
 	 * that they win if libc defines a locale with the same name.
 	 */
 	PG_CMD_PRINTF("INSERT INTO pg_collation (oid, collname, collnamespace, collowner, collprovider, collisdeterministic, collencoding, colliculocale)"
-				  "VALUES (pg_nextoid('pg_catalog.pg_collation', 'oid', 'pg_catalog.pg_collation_oid_index'), 'unicode', 'pg_catalog'::regnamespace, %u, '%c', true, -1, '');\n\n",
+				  "VALUES (pg_nextoid('pg_catalog.pg_collation', 'oid', 'pg_catalog.pg_collation_oid_index'), 'unicode', 'pg_catalog'::regnamespace, %u, '%c', true, -1, 'und');\n\n",
 				  BOOTSTRAP_SUPERUSERID, COLLPROVIDER_ICU);
 
 	PG_CMD_PRINTF("INSERT INTO pg_collation (oid, collname, collnamespace, collowner, collprovider, collisdeterministic, collencoding, collcollate, collctype)"
diff --git a/src/test/regress/expected/collate.icu.utf8.out b/src/test/regress/expected/collate.icu.utf8.out
index 6225b575ce..f135200c99 100644
--- a/src/test/regress/expected/collate.icu.utf8.out
+++ b/src/test/regress/expected/collate.icu.utf8.out
@@ -1312,6 +1312,13 @@ SELECT 'aBcD' COLLATE lt_insensitive = 'AbCd' COLLATE lt_insensitive;
  t
 (1 row)
 
+CREATE COLLATION lt_upperfirst (provider = icu, locale = 'und-u-kf-upper');
+SELECT 'Z' COLLATE lt_upperfirst < 'z' COLLATE lt_upperfirst;
+ ?column? 
+----------
+ t
+(1 row)
+
 CREATE TABLE test1cs (x text COLLATE case_sensitive);
 CREATE TABLE test2cs (x text COLLATE case_sensitive);
 CREATE TABLE test3cs (x text COLLATE case_sensitive);
diff --git a/src/test/regress/sql/collate.icu.utf8.sql b/src/test/regress/sql/collate.icu.utf8.sql
index 64cbfd0a5b..8105ebc8ae 100644
--- a/src/test/regress/sql/collate.icu.utf8.sql
+++ b/src/test/regress/sql/collate.icu.utf8.sql
@@ -521,6 +521,8 @@ SELECT 'abc' <= 'ABC' COLLATE case_insensitive, 'abc' >= 'ABC' COLLATE case_inse
 -- test language tags
 CREATE COLLATION lt_insensitive (provider = icu, locale = 'en-u-ks-level1', deterministic = false);
 SELECT 'aBcD' COLLATE lt_insensitive = 'AbCd' COLLATE lt_insensitive;
+CREATE COLLATION lt_upperfirst (provider = icu, locale = 'und-u-kf-upper');
+SELECT 'Z' COLLATE lt_upperfirst < 'z' COLLATE lt_upperfirst;
 
 CREATE TABLE test1cs (x text COLLATE case_sensitive);
 CREATE TABLE test2cs (x text COLLATE case_sensitive);
-- 
2.34.1

From 537a53c0504b655a3d91156bfff3d5effcf21e06 Mon Sep 17 00:00:00 2001
From: Jeff Davis <j...@j-davis.com>
Date: Wed, 15 Mar 2023 11:27:12 -0700
Subject: [PATCH v7 4/7] Accept C/POSIX locales when converting to language
 tag.

Account for locales "C" and "POSIX", which correspond to the
language tag "en-US-u-va-posix".

Add a SQL function pg_language_tag() that performs the conversion.

Also, don't rely on a fixed-size buffer for language tags, as there is
no defined upper limit (cf. RFC 5646 section 4.4).
---
 doc/src/sgml/func.sgml                        | 15 ++++
 src/backend/commands/collationcmds.c          | 44 ++++++------
 src/backend/utils/adt/pg_locale.c             | 68 +++++++++++++++++++
 src/bin/pg_dump/t/002_pg_dump.pl              |  4 +-
 src/include/catalog/catversion.h              |  2 +-
 src/include/catalog/pg_proc.dat               |  5 ++
 src/include/utils/pg_locale.h                 |  1 +
 .../regress/expected/collate.icu.utf8.out     | 55 +++++++++++++++
 src/test/regress/sql/collate.icu.utf8.sql     | 10 +++
 9 files changed, 180 insertions(+), 24 deletions(-)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index a3a13b895f..35cecc24c8 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -27530,6 +27530,21 @@ postgres=# SELECT '0/0'::pg_lsn + pd.segment_number * ps.setting::int + :offset
         Use of this function is restricted to superusers.
        </para></entry>
       </row>
+
+      <row>
+       <entry role="func_table_entry"><para role="func_signature">
+        <indexterm>
+         <primary>pg_language_tag</primary>
+        </indexterm>
+        <function>pg_language_tag</function> ( <parameter>locale</parameter> <type>text</type> )
+        <returnvalue>text</returnvalue>
+       </para>
+       <para>
+        Canonicalizes the given <parameter>locale</parameter> string into a
+        BCP 47 language tag (see <xref
+        linkend="collation-managing-create-icu"/>).
+       </para></entry>
+      </row>
      </tbody>
     </tgroup>
    </table>
diff --git a/src/backend/commands/collationcmds.c b/src/backend/commands/collationcmds.c
index 3d0aea0568..ca1d46669f 100644
--- a/src/backend/commands/collationcmds.c
+++ b/src/backend/commands/collationcmds.c
@@ -576,26 +576,6 @@ cmpaliases(const void *a, const void *b)
 
 
 #ifdef USE_ICU
-/*
- * Get the ICU language tag for a locale name.
- * The result is a palloc'd string.
- */
-static char *
-get_icu_language_tag(const char *localename)
-{
-	char		buf[ULOC_FULLNAME_CAPACITY];
-	UErrorCode	status;
-
-	status = U_ZERO_ERROR;
-	uloc_toLanguageTag(localename, buf, sizeof(buf), true, &status);
-	if (U_FAILURE(status))
-		ereport(ERROR,
-				(errmsg("could not convert locale name \"%s\" to language tag: %s",
-						localename, u_errorName(status))));
-
-	return pstrdup(buf);
-}
-
 /*
  * Get a comment (specifically, the display name) for an ICU locale.
  * The result is a palloc'd string, or NULL if we can't get a comment
@@ -957,7 +937,7 @@ pg_import_system_collations(PG_FUNCTION_ARGS)
 			else
 				name = uloc_getAvailable(i);
 
-			langtag = get_icu_language_tag(name);
+			langtag = icu_language_tag(name, false);
 
 			/*
 			 * Be paranoid about not allowing any non-ASCII strings into
@@ -1014,3 +994,25 @@ pg_import_system_collations(PG_FUNCTION_ARGS)
 
 	PG_RETURN_INT32(ncreated);
 }
+
+/*
+ * pg_language_tag
+ *
+ * Return the BCP47 language tag representation of the given locale string.
+ */
+Datum
+pg_language_tag(PG_FUNCTION_ARGS)
+{
+#ifdef USE_ICU
+	text	*locale_text = PG_GETARG_TEXT_PP(0);
+	char	*locale_cstr = text_to_cstring(locale_text);
+	char	*langtag	 = icu_language_tag(locale_cstr, false);
+
+	PG_RETURN_TEXT_P(cstring_to_text(langtag));
+#else
+	ereport(ERROR,
+			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			 errmsg("ICU is not supported in this build")));
+	PG_RETURN_NULL();
+#endif
+}
diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index 052db11413..baafc71a3d 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -2840,6 +2840,74 @@ check_icu_locale(const char *icu_locale)
 #endif
 }
 
+#ifdef USE_ICU
+/*
+ * Return the BCP47 language tag representation of the requested locale.
+ *
+ * This function should be called before passing the string to ucol_open(),
+ * because conversion to a language tag also performs "level 2
+ * canonicalization". In addition to producing a consistent format, level 2
+ * canonicalization is able to more accurately interpret different input
+ * locale string formats, such as POSIX and .NET IDs.
+ */
+char *
+icu_language_tag(const char *loc_str, bool noError)
+{
+	UErrorCode	 status;
+	char		 lang[ULOC_LANG_CAPACITY];
+	char		*langtag;
+	size_t		 buflen = 32;	/* arbitrary starting buffer size */
+	const bool	 strict = true;
+
+	status = U_ZERO_ERROR;
+	uloc_getLanguage(loc_str, lang, ULOC_LANG_CAPACITY, &status);
+	if (U_FAILURE(status))
+	{
+		ereport(ERROR,
+				(errmsg("could not get language from locale \"%s\": %s",
+						loc_str, u_errorName(status))));
+	}
+
+	/* C/POSIX locales aren't handled by uloc_getLanguageTag() */
+	if (strcmp(lang, "c") == 0 || strcmp(lang, "posix") == 0)
+		return pstrdup("en-US-u-va-posix");
+
+	/*
+	 * A BCP47 language tag doesn't have a clearly-defined upper limit
+	 * (cf. RFC5646 section 4.4). Additionally, in older ICU versions,
+	 * uloc_toLanguageTag() doesn't always return the ultimate length on the
+	 * first call, necessitating a loop.
+	 */
+	langtag = palloc(buflen);
+	while (true)
+	{
+		int32_t		len;
+
+		status = U_ZERO_ERROR;
+		len = uloc_toLanguageTag(loc_str, langtag, buflen, strict, &status);
+		if (len < buflen || buflen >= MaxAllocSize)
+			break;
+
+		buflen = Min(buflen * 2, MaxAllocSize);
+		langtag = repalloc(langtag, buflen);
+	}
+
+	if (U_FAILURE(status))
+	{
+		pfree(langtag);
+		if (noError)
+			return NULL;
+
+		ereport(ERROR,
+				(errmsg("could not convert locale name \"%s\" to language tag: %s",
+						loc_str, u_errorName(status))));
+	}
+
+	return langtag;
+}
+
+#endif							/* USE_ICU */
+
 /*
  * These functions convert from/to libc's wchar_t, *not* pg_wchar_t.
  * Therefore we keep them here rather than with the mbutils code.
diff --git a/src/bin/pg_dump/t/002_pg_dump.pl b/src/bin/pg_dump/t/002_pg_dump.pl
index a22f27f300..0b38c0537b 100644
--- a/src/bin/pg_dump/t/002_pg_dump.pl
+++ b/src/bin/pg_dump/t/002_pg_dump.pl
@@ -1837,9 +1837,9 @@ my %tests = (
 
 	'CREATE COLLATION icu_collation' => {
 		create_order => 76,
-		create_sql   => "CREATE COLLATION icu_collation (PROVIDER = icu, LOCALE = 'C');",
+		create_sql   => "CREATE COLLATION icu_collation (PROVIDER = icu, LOCALE = 'en-US-u-va-posix');",
 		regexp =>
-		  qr/CREATE COLLATION public.icu_collation \(provider = icu, locale = 'C'(, version = '[^']*')?\);/m,
+		  qr/CREATE COLLATION public.icu_collation \(provider = icu, locale = 'en-US-u-va-posix'(, version = '[^']*')?\);/m,
 		icu => 1,
 		like      => { %full_runs, section_pre_data => 1, },
 	},
diff --git a/src/include/catalog/catversion.h b/src/include/catalog/catversion.h
index e94528a7c7..d993539dfe 100644
--- a/src/include/catalog/catversion.h
+++ b/src/include/catalog/catversion.h
@@ -57,6 +57,6 @@
  */
 
 /*							yyyymmddN */
-#define CATALOG_VERSION_NO	202303181
+#define CATALOG_VERSION_NO	202303211
 
 #endif
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 5cf87aeb2c..43db94557d 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -11838,6 +11838,11 @@
   proname => 'pg_database_collation_actual_version', procost => '100',
   provolatile => 'v', prorettype => 'text', proargtypes => 'oid',
   prosrc => 'pg_database_collation_actual_version' },
+{ oid => '6273',
+  descr => 'get BCP47 language tag representation of locale',
+  proname => 'pg_language_tag', procost => '100',
+  provolatile => 's', prorettype => 'text', proargtypes => 'text',
+  prosrc => 'pg_language_tag' },
 
 # system management/monitoring related functions
 { oid => '3353', descr => 'list files in the log directory',
diff --git a/src/include/utils/pg_locale.h b/src/include/utils/pg_locale.h
index dd822a68be..ae9077c9bc 100644
--- a/src/include/utils/pg_locale.h
+++ b/src/include/utils/pg_locale.h
@@ -121,6 +121,7 @@ extern size_t pg_strnxfrm_prefix(char *dest, size_t destsize, const char *src,
 #ifdef USE_ICU
 extern int32_t icu_to_uchar(UChar **buff_uchar, const char *buff, size_t nbytes);
 extern int32_t icu_from_uchar(char **result, const UChar *buff_uchar, int32_t len_uchar);
+extern char *icu_language_tag(const char *loc_str, bool noError);
 #endif
 extern void check_icu_locale(const char *icu_locale);
 
diff --git a/src/test/regress/expected/collate.icu.utf8.out b/src/test/regress/expected/collate.icu.utf8.out
index f135200c99..d8e6240cd7 100644
--- a/src/test/regress/expected/collate.icu.utf8.out
+++ b/src/test/regress/expected/collate.icu.utf8.out
@@ -11,6 +11,61 @@ SELECT getdatabaseencoding() <> 'UTF8' OR
 SET client_encoding TO UTF8;
 CREATE SCHEMA collate_tests;
 SET search_path = collate_tests;
+-- test language tag canonicalization
+SELECT pg_language_tag('en_US');
+ pg_language_tag 
+-----------------
+ en-US
+(1 row)
+
+SELECT pg_language_tag('nonsense');
+ pg_language_tag 
+-----------------
+ nonsense
+(1 row)
+
+SELECT pg_language_tag('C.UTF-8');
+ pg_language_tag  
+------------------
+ en-US-u-va-posix
+(1 row)
+
+SELECT pg_language_tag('POSIX');
+ pg_language_tag  
+------------------
+ en-US-u-va-posix
+(1 row)
+
+SELECT pg_language_tag('en_US_POSIX');
+ pg_language_tag  
+------------------
+ en-US-u-va-posix
+(1 row)
+
+SELECT pg_language_tag('@colStrength=secondary');
+ pg_language_tag 
+-----------------
+ und-u-ks-level2
+(1 row)
+
+SELECT pg_language_tag('');
+ pg_language_tag 
+-----------------
+ und
+(1 row)
+
+SELECT pg_language_tag('fr_CA.UTF-8');
+ pg_language_tag 
+-----------------
+ fr-CA
+(1 row)
+
+SELECT pg_language_tag('en_US@colStrength=primary');
+  pg_language_tag  
+-------------------
+ en-US-u-ks-level1
+(1 row)
+
 CREATE TABLE collate_test1 (
     a int,
     b text COLLATE "en-x-icu" NOT NULL
diff --git a/src/test/regress/sql/collate.icu.utf8.sql b/src/test/regress/sql/collate.icu.utf8.sql
index 8105ebc8ae..c7241c739a 100644
--- a/src/test/regress/sql/collate.icu.utf8.sql
+++ b/src/test/regress/sql/collate.icu.utf8.sql
@@ -15,6 +15,16 @@ SET client_encoding TO UTF8;
 CREATE SCHEMA collate_tests;
 SET search_path = collate_tests;
 
+-- test language tag canonicalization
+SELECT pg_language_tag('en_US');
+SELECT pg_language_tag('nonsense');
+SELECT pg_language_tag('C.UTF-8');
+SELECT pg_language_tag('POSIX');
+SELECT pg_language_tag('en_US_POSIX');
+SELECT pg_language_tag('@colStrength=secondary');
+SELECT pg_language_tag('');
+SELECT pg_language_tag('fr_CA.UTF-8');
+SELECT pg_language_tag('en_US@colStrength=primary');
 
 CREATE TABLE collate_test1 (
     a int,
-- 
2.34.1

From c39c714fecd2fcf304e2fc8a120b9f6851cd78db Mon Sep 17 00:00:00 2001
From: Jeff Davis <j...@j-davis.com>
Date: Wed, 22 Mar 2023 10:06:23 -0700
Subject: [PATCH v7 5/7] initdb: emit message when using default ICU locale.

Also, minor cleanup to separate the code that chooses the default ICU
locale from the code that verifies that a specified locale can be
opened with ucol_open(). This cleanup creates a better place for an
important comment.
---
 src/bin/initdb/initdb.c | 77 ++++++++++++++++++++++++++++-------------
 1 file changed, 52 insertions(+), 25 deletions(-)

diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index d48b7b6060..7f857f6075 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -2039,46 +2039,73 @@ check_icu_locale_encoding(int user_enc)
 	return true;
 }
 
+#ifdef USE_ICU
+
 /*
- * Check that ICU accepts the locale name; or if not specified, retrieve the
- * default ICU locale.
+ * Determine default ICU locale by opening the default collator and reading
+ * its locale.
+ *
+ * NB: The default collator (opened using NULL) is different from the collator
+ * for the root locale (opened with "", "und", or "root"). The former depends
+ * on the environment (useful at initdb time) and the latter does not.
  */
-static void
-check_icu_locale(void)
+static char *
+default_icu_locale(void)
 {
-#ifdef USE_ICU
 	UCollator	*collator;
 	UErrorCode   status;
+	const char	*valid_locale;
+	char		*default_locale;
 
 	status = U_ZERO_ERROR;
-	collator = ucol_open(icu_locale, &status);
+	collator = ucol_open(NULL, &status);
+	if (U_FAILURE(status))
+		pg_fatal("could not open collator for default locale: %s",
+				 u_errorName(status));
+
+	status = U_ZERO_ERROR;
+	valid_locale = ucol_getLocaleByType(collator, ULOC_VALID_LOCALE,
+										&status);
 	if (U_FAILURE(status))
 	{
-		if (icu_locale)
-			pg_fatal("could not open collator for locale \"%s\": %s",
-					 icu_locale, u_errorName(status));
-		else
-			pg_fatal("could not open collator for default locale: %s",
-					 u_errorName(status));
+		ucol_close(collator);
+		pg_fatal("could not determine default ICU locale");
 	}
 
-	/* if not specified, get locale from default collator */
-	if (icu_locale == NULL)
-	{
-		const char	*default_locale;
+	default_locale = pg_strdup(valid_locale);
 
-		status = U_ZERO_ERROR;
-		default_locale = ucol_getLocaleByType(collator, ULOC_VALID_LOCALE,
-											  &status);
-		if (U_FAILURE(status))
-		{
-			ucol_close(collator);
-			pg_fatal("could not determine default ICU locale");
-		}
+	ucol_close(collator);
 
-		icu_locale = pg_strdup(default_locale);
+	return default_locale;
+}
+
+#endif
+
+/*
+ * If not specified, assign the default locale. Then check that ICU accepts
+ * the locale.
+ */
+static void
+check_icu_locale(void)
+{
+#ifdef USE_ICU
+	UCollator	*collator;
+	UErrorCode   status;
+
+	/* acquire default locale from the environment, if not specified */
+	if (icu_locale == NULL)
+	{
+		icu_locale = default_icu_locale();
+		printf(_("Using default ICU locale \"%s\".\n"), icu_locale);
 	}
 
+	/* check that the resulting locale can be opened */
+	status = U_ZERO_ERROR;
+	collator = ucol_open(icu_locale, &status);
+	if (U_FAILURE(status))
+		pg_fatal("could not open collator for locale \"%s\": %s",
+				 icu_locale, u_errorName(status));
+
 	ucol_close(collator);
 #endif
 }
-- 
2.34.1

From ba447ed36dc028e9fb1a0a63d392531223bb8ffc Mon Sep 17 00:00:00 2001
From: Jeff Davis <j...@j-davis.com>
Date: Wed, 15 Mar 2023 12:37:06 -0700
Subject: [PATCH v7 6/7] Canonicalize ICU locale names to language tags.

Convert to BCP47 language tags before storing in the catalog, except
during binary upgrade or when the locale comes from an existing
collation or template database.

Canonicalization is important, because it's able to handle more kinds
of locale strings than ucol_open(). Without canonicalizing first, a
locale string like "fr_CA.UTF-8" will be misinterpreted by
ucol_open().

The resulting language tags can vary slightly between ICU
versions. For instance, "@colBackwards=yes" is converted to
"und-u-kb-true" in older versions of ICU, and to the simpler (but
equivalent) "und-u-kb" in newer versions.

Discussion: https://postgr.es/m/8c7af6820aed94dc7bc259d2aa7f9663518e6137.ca...@j-davis.com
---
 src/backend/commands/collationcmds.c          | 38 +++++++++++
 src/backend/commands/dbcommands.c             | 32 +++++++++
 src/backend/utils/adt/pg_locale.c             |  9 ---
 src/bin/initdb/initdb.c                       | 68 ++++++++++++++++++-
 .../regress/expected/collate.icu.utf8.out     | 25 ++++++-
 src/test/regress/sql/collate.icu.utf8.sql     | 13 ++++
 6 files changed, 173 insertions(+), 12 deletions(-)

diff --git a/src/backend/commands/collationcmds.c b/src/backend/commands/collationcmds.c
index ca1d46669f..bcddd1d536 100644
--- a/src/backend/commands/collationcmds.c
+++ b/src/backend/commands/collationcmds.c
@@ -165,6 +165,11 @@ DefineCollation(ParseState *pstate, List *names, List *parameters, bool if_not_e
 		else
 			colliculocale = NULL;
 
+		/*
+		 * When the ICU locale comes from an existing collation, do not
+		 * canonicalize to a language tag.
+		 */
+
 		datum = SysCacheGetAttr(COLLOID, tp, Anum_pg_collation_collicurules, &isnull);
 		if (!isnull)
 			collicurules = TextDatumGetCString(datum);
@@ -254,10 +259,43 @@ DefineCollation(ParseState *pstate, List *names, List *parameters, bool if_not_e
 		}
 		else if (collprovider == COLLPROVIDER_ICU)
 		{
+#ifdef USE_ICU
+			char	*langtag;
+
 			if (!colliculocale)
 				ereport(ERROR,
 						(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
 						 errmsg("parameter \"locale\" must be specified")));
+
+			check_icu_locale(colliculocale);
+
+			/*
+			 * During binary upgrade, preserve the locale string. Otherwise,
+			 * canonicalize to a language tag.
+			 */
+			if (!IsBinaryUpgrade)
+			{
+				langtag = icu_language_tag(colliculocale, true);
+				if (langtag)
+				{
+					ereport(NOTICE,
+							(errmsg("using language tag \"%s\" for locale \"%s\"",
+									langtag, colliculocale)));
+
+					colliculocale = langtag;
+				}
+				else
+				{
+					ereport(WARNING,
+							(errmsg("could not convert locale \"%s\" to language tag",
+									colliculocale)));
+				}
+			}
+#else
+			ereport(ERROR,
+					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+					 errmsg("ICU is not supported in this build")));
+#endif
 		}
 
 		/*
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 4d5d5d6866..5935477f44 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -1043,6 +1043,9 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 
 	if (dblocprovider == COLLPROVIDER_ICU)
 	{
+#ifdef USE_ICU
+		char	*langtag;
+
 		if (!(is_encoding_supported_by_icu(encoding)))
 			ereport(ERROR,
 					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
@@ -1059,6 +1062,35 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 					 errmsg("ICU locale must be specified")));
 
 		check_icu_locale(dbiculocale);
+
+		/*
+		 * During binary upgrade, or when the locale came from the template
+		 * database, preserve locale string. Otherwise, canonicalize to a
+		 * language tag.
+		 */
+		if (!IsBinaryUpgrade && dbiculocale != src_iculocale)
+		{
+			langtag = icu_language_tag(dbiculocale, true);
+			if (langtag)
+			{
+				ereport(NOTICE,
+						(errmsg("using language tag \"%s\" for locale \"%s\"",
+								langtag, dbiculocale)));
+
+				dbiculocale = langtag;
+			}
+			else
+			{
+				ereport(WARNING,
+						(errmsg("could not convert locale \"%s\" to language tag",
+								dbiculocale)));
+			}
+		}
+#else
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("ICU is not supported in this build")));
+#endif
 	}
 	else
 	{
diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index baafc71a3d..7a5376dc75 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -2820,27 +2820,18 @@ icu_set_collation_attributes(UCollator *collator, const char *loc)
 	pfree(lower_str);
 }
 
-#endif							/* USE_ICU */
-
 /*
  * Check if the given locale ID is valid, and ereport(ERROR) if it isn't.
  */
 void
 check_icu_locale(const char *icu_locale)
 {
-#ifdef USE_ICU
 	UCollator  *collator;
 
 	collator = pg_ucol_open(icu_locale);
 	ucol_close(collator);
-#else
-	ereport(ERROR,
-			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-			 errmsg("ICU is not supported in this build")));
-#endif
 }
 
-#ifdef USE_ICU
 /*
  * Return the BCP47 language tag representation of the requested locale.
  *
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 7f857f6075..609bd9ea1e 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -2079,11 +2079,67 @@ default_icu_locale(void)
 	return default_locale;
 }
 
+/*
+ * Convert to canonical BCP47 language tag. Must be consistent with
+ * icu_language_tag().
+ */
+static char *
+icu_language_tag(const char *loc_str)
+{
+	UErrorCode	 status;
+	char		 lang[ULOC_LANG_CAPACITY];
+	char		*langtag;
+	size_t		 buflen = 32;	/* arbitrary starting buffer size */
+	const bool	 strict = true;
+
+	status = U_ZERO_ERROR;
+	uloc_getLanguage(loc_str, lang, ULOC_LANG_CAPACITY, &status);
+	if (U_FAILURE(status))
+	{
+		pg_fatal("could not get language from locale \"%s\": %s",
+				 loc_str, u_errorName(status));
+	}
+
+	/* C/POSIX locales aren't handled by uloc_getLanguageTag() */
+	if (strcmp(lang, "c") == 0 || strcmp(lang, "posix") == 0)
+		return pstrdup("en-US-u-va-posix");
+
+	/*
+	 * A BCP47 language tag doesn't have a clearly-defined upper limit
+	 * (cf. RFC5646 section 4.4). Additionally, in older ICU versions,
+	 * uloc_toLanguageTag() doesn't always return the ultimate length on the
+	 * first call, necessitating a loop.
+	 */
+	langtag = pg_malloc(buflen);
+	while (true)
+	{
+		int32_t		len;
+
+		status = U_ZERO_ERROR;
+		len = uloc_toLanguageTag(loc_str, langtag, buflen, strict, &status);
+		if (len < buflen)
+			break;
+
+		buflen = buflen * 2;
+		langtag = pg_realloc(langtag, buflen);
+	}
+
+	if (U_FAILURE(status))
+	{
+		pg_free(langtag);
+
+		pg_fatal("could not convert locale name \"%s\" to language tag: %s",
+				 loc_str, u_errorName(status));
+	}
+
+	return langtag;
+}
+
 #endif
 
 /*
- * If not specified, assign the default locale. Then check that ICU accepts
- * the locale.
+ * If not specified, assign the default locale. Then convert to a language
+ * tag, and check that ICU accepts it.
  */
 static void
 check_icu_locale(void)
@@ -2091,6 +2147,7 @@ check_icu_locale(void)
 #ifdef USE_ICU
 	UCollator	*collator;
 	UErrorCode   status;
+	char		*langtag;
 
 	/* acquire default locale from the environment, if not specified */
 	if (icu_locale == NULL)
@@ -2099,6 +2156,13 @@ check_icu_locale(void)
 		printf(_("Using default ICU locale \"%s\".\n"), icu_locale);
 	}
 
+	/* canonicalize to a language tag */
+	langtag = icu_language_tag(icu_locale);
+	printf(_("Using language tag \"%s\" for ICU locale \"%s\".\n"),
+		   langtag, icu_locale);
+	pg_free(icu_locale);
+	icu_locale = langtag;
+
 	/* check that the resulting locale can be opened */
 	status = U_ZERO_ERROR;
 	collator = ucol_open(icu_locale, &status);
diff --git a/src/test/regress/expected/collate.icu.utf8.out b/src/test/regress/expected/collate.icu.utf8.out
index d8e6240cd7..730decc4cb 100644
--- a/src/test/regress/expected/collate.icu.utf8.out
+++ b/src/test/regress/expected/collate.icu.utf8.out
@@ -1074,6 +1074,7 @@ reset enable_seqscan;
 CREATE ROLE regress_test_role;
 CREATE SCHEMA test_schema;
 -- We need to do this this way to cope with varying names for encodings:
+SET client_min_messages TO WARNING;
 do $$
 BEGIN
   EXECUTE 'CREATE COLLATION test0 (provider = icu, locale = ' ||
@@ -1088,9 +1089,11 @@ BEGIN
           quote_literal(current_setting('lc_collate')) || ');';
 END
 $$;
+RESET client_min_messages;
 CREATE COLLATION test3 (provider = icu, lc_collate = 'en_US.utf8'); -- fail, needs "locale"
 ERROR:  parameter "locale" must be specified
 CREATE COLLATION testx (provider = icu, locale = 'nonsense'); /* never fails with ICU */  DROP COLLATION testx;
+NOTICE:  using language tag "nonsense" for locale "nonsense"
 CREATE COLLATION test4 FROM nonsense;
 ERROR:  collation "nonsense" for encoding "UTF8" does not exist
 CREATE COLLATION test5 FROM test0;
@@ -1217,14 +1220,18 @@ SELECT * FROM collate_test2 ORDER BY b COLLATE UNICODE;
 
 -- test ICU collation customization
 -- test the attributes handled by icu_set_collation_attributes()
+SET client_min_messages=WARNING;
 CREATE COLLATION testcoll_ignore_accents (provider = icu, locale = '@colStrength=primary;colCaseLevel=yes');
+RESET client_min_messages;
 SELECT 'aaá' > 'AAA' COLLATE "und-x-icu", 'aaá' < 'AAA' COLLATE testcoll_ignore_accents;
  ?column? | ?column? 
 ----------+----------
  t        | t
 (1 row)
 
+SET client_min_messages=WARNING;
 CREATE COLLATION testcoll_backwards (provider = icu, locale = '@colBackwards=yes');
+RESET client_min_messages;
 SELECT 'coté' < 'côte' COLLATE "und-x-icu", 'coté' > 'côte' COLLATE testcoll_backwards;
  ?column? | ?column? 
 ----------+----------
@@ -1232,7 +1239,9 @@ SELECT 'coté' < 'côte' COLLATE "und-x-icu", 'coté' > 'côte' COLLATE testcoll
 (1 row)
 
 CREATE COLLATION testcoll_lower_first (provider = icu, locale = '@colCaseFirst=lower');
+NOTICE:  using language tag "und-u-kf-lower" for locale "@colCaseFirst=lower"
 CREATE COLLATION testcoll_upper_first (provider = icu, locale = '@colCaseFirst=upper');
+NOTICE:  using language tag "und-u-kf-upper" for locale "@colCaseFirst=upper"
 SELECT 'aaa' < 'AAA' COLLATE testcoll_lower_first, 'aaa' > 'AAA' COLLATE testcoll_upper_first;
  ?column? | ?column? 
 ----------+----------
@@ -1240,13 +1249,16 @@ SELECT 'aaa' < 'AAA' COLLATE testcoll_lower_first, 'aaa' > 'AAA' COLLATE testcol
 (1 row)
 
 CREATE COLLATION testcoll_shifted (provider = icu, locale = '@colAlternate=shifted');
+NOTICE:  using language tag "und-u-ka-shifted" for locale "@colAlternate=shifted"
 SELECT 'de-luge' < 'deanza' COLLATE "und-x-icu", 'de-luge' > 'deanza' COLLATE testcoll_shifted;
  ?column? | ?column? 
 ----------+----------
  t        | t
 (1 row)
 
+SET client_min_messages=WARNING;
 CREATE COLLATION testcoll_numeric (provider = icu, locale = '@colNumeric=yes');
+RESET client_min_messages;
 SELECT 'A-21' > 'A-123' COLLATE "und-x-icu", 'A-21' < 'A-123' COLLATE testcoll_numeric;
  ?column? | ?column? 
 ----------+----------
@@ -1258,6 +1270,7 @@ ERROR:  could not open collator for locale "@colNumeric=lower": U_ILLEGAL_ARGUME
 -- test that attributes not handled by icu_set_collation_attributes()
 -- (handled by ucol_open() directly) also work
 CREATE COLLATION testcoll_de_phonebook (provider = icu, locale = 'de@collation=phonebook');
+NOTICE:  using language tag "de-u-co-phonebk" for locale "de@collation=phonebook"
 SELECT 'Goldmann' < 'Götz' COLLATE "de-x-icu", 'Goldmann' > 'Götz' COLLATE testcoll_de_phonebook;
  ?column? | ?column? 
 ----------+----------
@@ -1266,6 +1279,7 @@ SELECT 'Goldmann' < 'Götz' COLLATE "de-x-icu", 'Goldmann' > 'Götz' COLLATE tes
 
 -- rules
 CREATE COLLATION testcoll_rules1 (provider = icu, locale = '', rules = '&a < g');
+NOTICE:  using language tag "und" for locale ""
 CREATE TABLE test7 (a text);
 -- example from https://unicode-org.github.io/icu/userguide/collation/customization/#syntax
 INSERT INTO test7 VALUES ('Abernathy'), ('apple'), ('bird'), ('Boston'), ('Graham'), ('green');
@@ -1293,10 +1307,13 @@ SELECT * FROM test7 ORDER BY a COLLATE testcoll_rules1;
 
 DROP TABLE test7;
 CREATE COLLATION testcoll_rulesx (provider = icu, locale = '', rules = '!!wrong!!');
-ERROR:  could not open collator for locale "" with rules "!!wrong!!": U_INVALID_FORMAT_ERROR
+NOTICE:  using language tag "und" for locale ""
+ERROR:  could not open collator for locale "und" with rules "!!wrong!!": U_INVALID_FORMAT_ERROR
 -- nondeterministic collations
 CREATE COLLATION ctest_det (provider = icu, locale = '', deterministic = true);
+NOTICE:  using language tag "und" for locale ""
 CREATE COLLATION ctest_nondet (provider = icu, locale = '', deterministic = false);
+NOTICE:  using language tag "und" for locale ""
 CREATE TABLE test6 (a int, b text);
 -- same string in different normal forms
 INSERT INTO test6 VALUES (1, U&'\00E4bc');
@@ -1346,7 +1363,9 @@ SELECT * FROM test6a WHERE b = ARRAY['äbc'] COLLATE ctest_nondet;
 (2 rows)
 
 CREATE COLLATION case_sensitive (provider = icu, locale = '');
+NOTICE:  using language tag "und" for locale ""
 CREATE COLLATION case_insensitive (provider = icu, locale = '@colStrength=secondary', deterministic = false);
+NOTICE:  using language tag "und-u-ks-level2" for locale "@colStrength=secondary"
 SELECT 'abc' <= 'ABC' COLLATE case_sensitive, 'abc' >= 'ABC' COLLATE case_sensitive;
  ?column? | ?column? 
 ----------+----------
@@ -1361,6 +1380,7 @@ SELECT 'abc' <= 'ABC' COLLATE case_insensitive, 'abc' >= 'ABC' COLLATE case_inse
 
 -- test language tags
 CREATE COLLATION lt_insensitive (provider = icu, locale = 'en-u-ks-level1', deterministic = false);
+NOTICE:  using language tag "en-u-ks-level1" for locale "en-u-ks-level1"
 SELECT 'aBcD' COLLATE lt_insensitive = 'AbCd' COLLATE lt_insensitive;
  ?column? 
 ----------
@@ -1368,6 +1388,7 @@ SELECT 'aBcD' COLLATE lt_insensitive = 'AbCd' COLLATE lt_insensitive;
 (1 row)
 
 CREATE COLLATION lt_upperfirst (provider = icu, locale = 'und-u-kf-upper');
+NOTICE:  using language tag "und-u-kf-upper" for locale "und-u-kf-upper"
 SELECT 'Z' COLLATE lt_upperfirst < 'z' COLLATE lt_upperfirst;
  ?column? 
 ----------
@@ -1828,7 +1849,9 @@ SELECT * FROM outer_text WHERE (f1, f2) NOT IN (SELECT * FROM inner_text);
 (2 rows)
 
 -- accents
+SET client_min_messages=WARNING;
 CREATE COLLATION ignore_accents (provider = icu, locale = '@colStrength=primary;colCaseLevel=yes', deterministic = false);
+RESET client_min_messages;
 CREATE TABLE test4 (a int, b text);
 INSERT INTO test4 VALUES (1, 'cote'), (2, 'côte'), (3, 'coté'), (4, 'côté');
 SELECT * FROM test4 WHERE b = 'cote';
diff --git a/src/test/regress/sql/collate.icu.utf8.sql b/src/test/regress/sql/collate.icu.utf8.sql
index c7241c739a..5f3a88a404 100644
--- a/src/test/regress/sql/collate.icu.utf8.sql
+++ b/src/test/regress/sql/collate.icu.utf8.sql
@@ -367,6 +367,8 @@ CREATE ROLE regress_test_role;
 CREATE SCHEMA test_schema;
 
 -- We need to do this this way to cope with varying names for encodings:
+SET client_min_messages TO WARNING;
+
 do $$
 BEGIN
   EXECUTE 'CREATE COLLATION test0 (provider = icu, locale = ' ||
@@ -380,6 +382,9 @@ BEGIN
           quote_literal(current_setting('lc_collate')) || ');';
 END
 $$;
+
+RESET client_min_messages;
+
 CREATE COLLATION test3 (provider = icu, lc_collate = 'en_US.utf8'); -- fail, needs "locale"
 CREATE COLLATION testx (provider = icu, locale = 'nonsense'); /* never fails with ICU */  DROP COLLATION testx;
 
@@ -464,10 +469,14 @@ SELECT * FROM collate_test2 ORDER BY b COLLATE UNICODE;
 
 -- test the attributes handled by icu_set_collation_attributes()
 
+SET client_min_messages=WARNING;
 CREATE COLLATION testcoll_ignore_accents (provider = icu, locale = '@colStrength=primary;colCaseLevel=yes');
+RESET client_min_messages;
 SELECT 'aaá' > 'AAA' COLLATE "und-x-icu", 'aaá' < 'AAA' COLLATE testcoll_ignore_accents;
 
+SET client_min_messages=WARNING;
 CREATE COLLATION testcoll_backwards (provider = icu, locale = '@colBackwards=yes');
+RESET client_min_messages;
 SELECT 'coté' < 'côte' COLLATE "und-x-icu", 'coté' > 'côte' COLLATE testcoll_backwards;
 
 CREATE COLLATION testcoll_lower_first (provider = icu, locale = '@colCaseFirst=lower');
@@ -477,7 +486,9 @@ SELECT 'aaa' < 'AAA' COLLATE testcoll_lower_first, 'aaa' > 'AAA' COLLATE testcol
 CREATE COLLATION testcoll_shifted (provider = icu, locale = '@colAlternate=shifted');
 SELECT 'de-luge' < 'deanza' COLLATE "und-x-icu", 'de-luge' > 'deanza' COLLATE testcoll_shifted;
 
+SET client_min_messages=WARNING;
 CREATE COLLATION testcoll_numeric (provider = icu, locale = '@colNumeric=yes');
+RESET client_min_messages;
 SELECT 'A-21' > 'A-123' COLLATE "und-x-icu", 'A-21' < 'A-123' COLLATE testcoll_numeric;
 
 CREATE COLLATION testcoll_error1 (provider = icu, locale = '@colNumeric=lower');
@@ -666,7 +677,9 @@ INSERT INTO inner_text VALUES ('a', NULL);
 SELECT * FROM outer_text WHERE (f1, f2) NOT IN (SELECT * FROM inner_text);
 
 -- accents
+SET client_min_messages=WARNING;
 CREATE COLLATION ignore_accents (provider = icu, locale = '@colStrength=primary;colCaseLevel=yes', deterministic = false);
+RESET client_min_messages;
 
 CREATE TABLE test4 (a int, b text);
 INSERT INTO test4 VALUES (1, 'cote'), (2, 'côte'), (3, 'coté'), (4, 'côté');
-- 
2.34.1

From 9c2929ef42f9d3c53eaf00c4390b087594475740 Mon Sep 17 00:00:00 2001
From: Jeff Davis <j...@j-davis.com>
Date: Fri, 17 Mar 2023 09:55:31 -0700
Subject: [PATCH v7 7/7] Validate ICU locales.

Ensure that it can be transformed into a language tag in "strict" mode
(which validates the attributes), and also that the language exists in
ICU.

Basic validation helps avoid minor mistakes and misspellings, which
often fall back to the root locale instead of the intended
locale. It's even more important in ICU versions 54 and earlier, where
the same (misspelled) locale string could fall back to different
locales depending on the environment.

Discussion: https://postgr.es/m/11b1eeb7e7667fdd4178497aeb796c48d26e69b9.ca...@j-davis.com
Discussion: https://postgr.es/m/df2efad0cae7c65180df8e5ebb709e5eb4f2a82b.ca...@j-davis.com
---
 doc/src/sgml/config.sgml                      | 17 +++++
 src/backend/commands/collationcmds.c          |  8 +--
 src/backend/commands/dbcommands.c             |  8 +--
 src/backend/utils/adt/pg_locale.c             | 64 +++++++++++++++++++
 src/backend/utils/misc/guc_tables.c           | 10 +++
 src/backend/utils/misc/postgresql.conf.sample |  2 +
 src/include/utils/pg_locale.h                 |  1 +
 .../regress/expected/collate.icu.utf8.out     | 10 ++-
 src/test/regress/sql/collate.icu.utf8.sql     |  6 +-
 9 files changed, 112 insertions(+), 14 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 481f93cea1..78eae3ca65 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -9784,6 +9784,23 @@ SET XML OPTION { DOCUMENT | CONTENT };
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-icu-locale-validation" xreflabel="icu_locale_validation">
+      <term><varname>icu_locale_validation</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>icu_locale_validation</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Validation is performed on an ICU locale specified for a new collation
+        or database. If this parameter is set to <literal>true</literal>, an
+        error is raised for a validation failure; if set to
+        <literal>false</literal>, a warning is issued. The default is
+        <literal>false</literal>.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-default-text-search-config" xreflabel="default_text_search_config">
       <term><varname>default_text_search_config</varname> (<type>string</type>)
       <indexterm>
diff --git a/src/backend/commands/collationcmds.c b/src/backend/commands/collationcmds.c
index bcddd1d536..90f7aabc88 100644
--- a/src/backend/commands/collationcmds.c
+++ b/src/backend/commands/collationcmds.c
@@ -284,12 +284,8 @@ DefineCollation(ParseState *pstate, List *names, List *parameters, bool if_not_e
 
 					colliculocale = langtag;
 				}
-				else
-				{
-					ereport(WARNING,
-							(errmsg("could not convert locale \"%s\" to language tag",
-									colliculocale)));
-				}
+
+				icu_validate_locale(colliculocale);
 			}
 #else
 			ereport(ERROR,
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 5935477f44..600b3a0f61 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -1079,12 +1079,8 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 
 				dbiculocale = langtag;
 			}
-			else
-			{
-				ereport(WARNING,
-						(errmsg("could not convert locale \"%s\" to language tag",
-								dbiculocale)));
-			}
+
+			icu_validate_locale(dbiculocale);
 		}
 #else
 		ereport(ERROR,
diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index 7a5376dc75..b651b99707 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -88,6 +88,7 @@
 
 #define		MAX_L10N_DATA		80
 
+extern bool icu_locale_validation;
 
 /* GUC settings */
 char	   *locale_messages;
@@ -2897,6 +2898,69 @@ icu_language_tag(const char *loc_str, bool noError)
 	return langtag;
 }
 
+/*
+ * Perform best-effort check that the locale is a valid one.
+ */
+void
+icu_validate_locale(const char *loc_str)
+{
+	UErrorCode	 status;
+	int			 elevel	 = icu_locale_validation ? ERROR : WARNING;
+	char		*langtag = icu_language_tag(loc_str, true);
+	char		 lang[ULOC_LANG_CAPACITY];
+
+	/* check that it can be converted to a language tag */
+	if (langtag == NULL)
+	{
+		ereport(elevel,
+				(errmsg("could not convert locale \"%s\" to language tag",
+						loc_str)));
+		return;
+	}
+	pfree(langtag);
+
+	/* validate that we can extract the language */
+	status = U_ZERO_ERROR;
+	uloc_getLanguage(loc_str, lang, ULOC_LANG_CAPACITY, &status);
+	if (U_FAILURE(status))
+	{
+		ereport(elevel,
+				(errmsg("could not get language from locale \"%s\": %s",
+						loc_str, u_errorName(status))));
+		return;
+	}
+
+	/* check for special languages */
+	if (strcmp(lang, "") == 0 ||
+		strcmp(lang, "root") == 0 || strcmp(lang, "und") == 0 ||
+		strcmp(lang, "c") == 0 || strcmp(lang, "posix") == 0)
+		return;
+
+	/* search for matching language within ICU */
+	for (int32_t i = 0; i < uloc_countAvailable(); i++)
+	{
+		const char	*otherloc = uloc_getAvailable(i);
+		char		 otherlang[ULOC_LANG_CAPACITY];
+
+		status = U_ZERO_ERROR;
+		uloc_getLanguage(otherloc, otherlang, ULOC_LANG_CAPACITY, &status);
+		if (U_FAILURE(status))
+		{
+			ereport(elevel,
+					(errmsg("could not get language from locale \"%s\": %s",
+							loc_str, u_errorName(status))));
+			continue;
+		}
+
+		if (strcmp(lang, otherlang) == 0)
+			return;
+	}
+
+	ereport(elevel,
+			(errmsg("language \"%s\" of locale \"%s\" not found",
+					lang, loc_str)));
+}
+
 #endif							/* USE_ICU */
 
 /*
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 1c0583fe26..1c63ed0d21 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -481,6 +481,7 @@ char	   *event_source;
 
 bool		row_security;
 bool		check_function_bodies = true;
+bool		icu_locale_validation = false;
 
 /*
  * This GUC exists solely for backward compatibility, check its definition for
@@ -1586,6 +1587,15 @@ struct config_bool ConfigureNamesBool[] =
 		true,
 		NULL, NULL, NULL
 	},
+	{
+		{"icu_locale_validation", PGC_USERSET, CLIENT_CONN_LOCALE,
+		 gettext_noop("Raise an error for invalid ICU locale strings."),
+		 NULL
+		},
+		&icu_locale_validation,
+		false,
+		NULL, NULL, NULL
+	},
 	{
 		{"array_nulls", PGC_USERSET, COMPAT_OPTIONS_PREVIOUS,
 			gettext_noop("Enable input of NULL elements in arrays."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index d06074b86f..cff927e8be 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -730,6 +730,8 @@
 #lc_numeric = 'C'			# locale for number formatting
 #lc_time = 'C'				# locale for time formatting
 
+#icu_locale_validation = off		# validate ICU locale strings
+
 # default configuration for text search
 #default_text_search_config = 'pg_catalog.simple'
 
diff --git a/src/include/utils/pg_locale.h b/src/include/utils/pg_locale.h
index ae9077c9bc..076665dfc3 100644
--- a/src/include/utils/pg_locale.h
+++ b/src/include/utils/pg_locale.h
@@ -122,6 +122,7 @@ extern size_t pg_strnxfrm_prefix(char *dest, size_t destsize, const char *src,
 extern int32_t icu_to_uchar(UChar **buff_uchar, const char *buff, size_t nbytes);
 extern int32_t icu_from_uchar(char **result, const UChar *buff_uchar, int32_t len_uchar);
 extern char *icu_language_tag(const char *loc_str, bool noError);
+extern void icu_validate_locale(const char *loc_str);
 #endif
 extern void check_icu_locale(const char *icu_locale);
 
diff --git a/src/test/regress/expected/collate.icu.utf8.out b/src/test/regress/expected/collate.icu.utf8.out
index 730decc4cb..5eeceb7e02 100644
--- a/src/test/regress/expected/collate.icu.utf8.out
+++ b/src/test/regress/expected/collate.icu.utf8.out
@@ -1092,8 +1092,16 @@ $$;
 RESET client_min_messages;
 CREATE COLLATION test3 (provider = icu, lc_collate = 'en_US.utf8'); -- fail, needs "locale"
 ERROR:  parameter "locale" must be specified
-CREATE COLLATION testx (provider = icu, locale = 'nonsense'); /* never fails with ICU */  DROP COLLATION testx;
+SET icu_locale_validation = true;
+CREATE COLLATION testx (provider = icu, locale = 'nonsense'); -- fails
 NOTICE:  using language tag "nonsense" for locale "nonsense"
+ERROR:  language "nonsense" of locale "nonsense" not found
+CREATE COLLATION testx (provider = icu, locale = '@colStrength=primary;nonsense=yes'); -- fails
+ERROR:  could not convert locale "@colStrength=primary;nonsense=yes" to language tag
+RESET icu_locale_validation;
+CREATE COLLATION testx (provider = icu, locale = 'nonsense'); DROP COLLATION testx;
+NOTICE:  using language tag "nonsense" for locale "nonsense"
+WARNING:  language "nonsense" of locale "nonsense" not found
 CREATE COLLATION test4 FROM nonsense;
 ERROR:  collation "nonsense" for encoding "UTF8" does not exist
 CREATE COLLATION test5 FROM test0;
diff --git a/src/test/regress/sql/collate.icu.utf8.sql b/src/test/regress/sql/collate.icu.utf8.sql
index 5f3a88a404..7d2c91252c 100644
--- a/src/test/regress/sql/collate.icu.utf8.sql
+++ b/src/test/regress/sql/collate.icu.utf8.sql
@@ -386,7 +386,11 @@ $$;
 RESET client_min_messages;
 
 CREATE COLLATION test3 (provider = icu, lc_collate = 'en_US.utf8'); -- fail, needs "locale"
-CREATE COLLATION testx (provider = icu, locale = 'nonsense'); /* never fails with ICU */  DROP COLLATION testx;
+SET icu_locale_validation = true;
+CREATE COLLATION testx (provider = icu, locale = 'nonsense'); -- fails
+CREATE COLLATION testx (provider = icu, locale = '@colStrength=primary;nonsense=yes'); -- fails
+RESET icu_locale_validation;
+CREATE COLLATION testx (provider = icu, locale = 'nonsense'); DROP COLLATION testx;
 
 CREATE COLLATION test4 FROM nonsense;
 CREATE COLLATION test5 FROM test0;
-- 
2.34.1

Re: ICU locale validation / canonicalization

Reply via email to