Another country has changed its name, and a Windows OS update has
again broken every PostgreSQL cluster in that whole country[1] (or at
least those that had accepted initdb's default choice of locale,
probably most).  Let's get to the bottom of this, because otherwise it
is simply going to keep happening, causing administrative pain for a
lot of people.

Here is a rebase of the basic patch I proposed last time, and a
re-statement of what we know:

1.  initdb chooses a default locale using a technique that gives you
an unstable ("Czech Republic"->"Czechia", "Turkey"->"Türkiye"),
non-ASCII ("Norwegian (Bokmål)") string that we are warned we should
not store anywhere.  We store it, and then later it is not recognised.
Instead we should select an IETF BCP 47 locale name, based on stable
ISO country and language codes, like "en-US", "tr-TR" etc.  Here is
the patch to teach initdb to use that, unchanged from v3 except that I
tweaked the docs a bit.

2.  In Windows 10+ it is now also possible to put ".UTF-8" on the end
of locale names.  I couldn't figure out whether we should do that, and
what effect it has on ctypes -- apparently not the effect I expected
(see upthread).  Was our UTF-8 support on Windows already broken, and
this new ".UTF-8" thing is just a new way to reach that brokenness?
Is it OK to continue to choose the "legacy" single byte encodings by
default on that OS, and consider that a separate topic for separate
research?

3.  It is not clear to me how we should deal with pg_upgrade.
Eventually we want all of the old-school names to fade away, and
pg_upgrade would need to be part of that.  Perhaps there is some API
that can be used to translate to the new canonical forms without us
having to maintain translation tables and other messiness in our tree.

4.  Eventually we should probably ban non-ASCII characters from
entering the relevant catalogues (they are shared, so their encoding
is undefined except that they must be a superset of ASCII), and delete
all the old win32setlocale.c kludges, after we reach a point where
everyone should be using exclusively BCP 47.

[1] 
https://www.postgresql.org/message-id/flat/18196-b10f93dfbde3d7db%40postgresql.org
From d015005cca08bc1c7ae487392ed7b5a4cfa58748 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 19 Jul 2022 06:31:17 +1200
Subject: [PATCH v4] Default to IETF BCP 47 locale names in initdb on Windows.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Avoid selecting traditional Windows locale names written with English
words, because (1) they are unstable and not recommended for use in
databases and (2) they may contain non-ASCII characters, which we can't
put in our shared catalogs.  Since setlocale() returns such names, on
Windows use GetUserDefaultLocaleName() if the user didn't provide an
explicit locale.  It returns BCP 47 strings like "en-US".

Also update the documentation to recommend BCP 47 over the traditional
names when providing explicit values to initdb.

Reviewed-by: Juan José Santamaría Flecha <juanjo.santamaria@gmail.com>
Discussion: https://postgr.es/m/CA%2BhUKGJ%3DXThErgAQRoqfCy1bKPxXVuF0%3D2zDbB%2BSxDs59pv7Fw%40mail.gmail.com
---
 doc/src/sgml/charset.sgml | 13 +++++++++++--
 src/bin/initdb/initdb.c   | 31 +++++++++++++++++++++++++++++--
 2 files changed, 40 insertions(+), 4 deletions(-)

diff --git a/doc/src/sgml/charset.sgml b/doc/src/sgml/charset.sgml
index 74783d148f..9a2cd5c2d5 100644
--- a/doc/src/sgml/charset.sgml
+++ b/doc/src/sgml/charset.sgml
@@ -83,8 +83,17 @@ initdb --locale=sv_SE
     system under what names depends on what was provided by the operating
     system vendor and what was installed.  On most Unix systems, the command
     <literal>locale -a</literal> will provide a list of available locales.
-    Windows uses more verbose locale names, such as <literal>German_Germany</literal>
-    or <literal>Swedish_Sweden.1252</literal>, but the principles are the same.
+   </para>
+
+   <para>
+    Windows uses BCP 47 language tags, like ICU.
+    For example, <literal>sv-SE</literal> represents Swedish as spoken in Sweden.
+    Windows also supports more verbose locale names based on full names
+    such as <literal>German_Germany</literal> or <literal>Swedish_Sweden.1252</literal>,
+    but these are not recommended because they are not stable across operating
+    system updates due to changes in geographical names, and may contain
+    non-ASCII characters which are not supported in PostgreSQL's shared
+    catalogs.
    </para>
 
    <para>
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 0c6f5ceb0a..021e847240 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -64,6 +64,10 @@
 #include "sys/mman.h"
 #endif
 
+#ifdef WIN32
+#include <winnls.h>
+#endif
+
 #include "access/xlog_internal.h"
 #include "catalog/pg_authid_d.h"
 #include "catalog/pg_class_d.h" /* pgrminclude ignore */
@@ -2132,6 +2136,7 @@ locale_date_order(const char *locale)
 static void
 check_locale_name(int category, const char *locale, char **canonname)
 {
+	char	   *locale_copy;
 	char	   *save;
 	char	   *res;
 
@@ -2147,10 +2152,30 @@ check_locale_name(int category, const char *locale, char **canonname)
 
 	/* for setlocale() call */
 	if (!locale)
-		locale = "";
+	{
+#ifdef WIN32
+		wchar_t		wide_name[LOCALE_NAME_MAX_LENGTH];
+		char		name[LOCALE_NAME_MAX_LENGTH];
+
+		/* use Windows API to find the default in BCP47 format */
+		if (GetUserDefaultLocaleName(wide_name, LOCALE_NAME_MAX_LENGTH) == 0)
+			pg_fatal("failed to get default locale name: error code %lu",
+					 GetLastError());
+		if (WideCharToMultiByte(CP_ACP, 0, wide_name, -1, name,
+								LOCALE_NAME_MAX_LENGTH, NULL, NULL) == 0)
+			pg_fatal("failed to convert locale name: error code %lu",
+					 GetLastError());
+		locale_copy = pg_strdup(name);
+#else
+		/* use environment to find the default */
+		locale_copy = pg_strdup("");
+#endif
+	}
+	else
+		locale_copy = pg_strdup(locale);
 
 	/* set the locale with setlocale, to see if it accepts it. */
-	res = setlocale(category, locale);
+	res = setlocale(category, locale_copy);
 
 	/* save canonical name if requested. */
 	if (res && canonname)
@@ -2183,6 +2208,8 @@ check_locale_name(int category, const char *locale, char **canonname)
 			pg_fatal("invalid locale settings; check LANG and LC_* environment variables");
 		}
 	}
+
+	free(locale_copy);
 }
 
 /*
-- 
2.39.2

Reply via email to