On Fri, 2025-10-10 at 17:48 -0700, Jeff Davis wrote:
> -------
> Summary
> -------
>
> The libc collation provider is a bad default[1]. The builtin
> collation
> provider is a good default, so let's use that.
The attached patches implement a more modest proposal which does not
conflict with Peter's objection about the display order:
0001: If the encoding is unspecified, and cannot be determined from the
locale (i.e. the locale is C), then use UTF-8 rather than SQL_ASCII.
0002: If the provider is unspecified, and the locale is C or C.UTF-8,
then use the builtin provider.
Motivation:
* UTF-8 seems safer than SQL_ASCII when the locale is compatible with
either.
* Whether the "C" locale uses the builtin provider or the libc provider
is mostly about the catalog representation, because the implementation
is the same. I don't have a strong motivation for this change, it just
clarifies that libc is not actually being used when the locale is "C".
* I think most users of the "C.UTF-8" locale would be better off with
the builtin provider, which benefits from important optimizations.
Note:
This would mean that "initdb --no-locale" would select UTF-8 and the
builtin provider with locale "C", whereas previously it would have
selected SQL_ASCII and the libc provider (though it didn't ever really
use libc internally). I'm not sure if others want this behavior or if
it would be surprising.
Regards,
Jeff Davis
From 9c8cf58c541462a6aef43fed0ddea1e9f1633960 Mon Sep 17 00:00:00 2001
From: Jeff Davis <[email protected]>
Date: Fri, 31 Oct 2025 13:36:46 -0700
Subject: [PATCH v1 1/2] initdb: prefer UTF-8 encoding over SQL_ASCII.
This was already true for the ICU locale provider, make it true for
the others.
---
src/bin/initdb/initdb.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 92fe2f531f7..aa7fc5a6636 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -2718,10 +2718,10 @@ setup_locale_encoding(void)
ctype_enc = pg_get_encoding_from_locale(lc_ctype, true);
/*
- * If ctype_enc=SQL_ASCII, it's compatible with any encoding. ICU does
- * not support SQL_ASCII, so select UTF-8 instead.
+ * If ctype_enc=SQL_ASCII, it's compatible with any encoding. Prefer
+ * UTF-8.
*/
- if (locale_provider == COLLPROVIDER_ICU && ctype_enc == PG_SQL_ASCII)
+ if (ctype_enc == PG_SQL_ASCII)
ctype_enc = PG_UTF8;
if (ctype_enc == -1)
--
2.43.0
From 8b1659fab50396eaeacab042aeaef8df241af467 Mon Sep 17 00:00:00 2001
From: Jeff Davis <[email protected]>
Date: Fri, 31 Oct 2025 14:05:10 -0700
Subject: [PATCH v1 2/2] initdb: if locale is C or C.UTF-8, use builtin
provider.
If the provider is unspecified, use the builtin provider C or
C.UTF-8. If the provider is specified, then do not override it.
The C locale has always been, effectively, the builtin provider, in
the sense that it uses built-in logic rather than strcoll(), etc. The
change here is mostly about the catalog representation.
The C.UTF-8 locale has used libc, but by doing so, collation doesn't
benefit from important performance optimizations. Now that we have a
builtin "C.UTF-8" collation which does benefit from those
optimizations, use that.
---
src/bin/initdb/initdb.c | 25 +++++++++++++++++++++++++
1 file changed, 25 insertions(+)
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index aa7fc5a6636..84931f145f4 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -145,6 +145,7 @@ static char *lc_numeric = NULL;
static char *lc_time = NULL;
static char *lc_messages = NULL;
static char locale_provider = COLLPROVIDER_LIBC;
+static bool locale_provider_specified = false;
static bool builtin_locale_specified = false;
static char *datlocale = NULL;
static bool icu_locale_specified = false;
@@ -2465,6 +2466,28 @@ setlocales(void)
lc_messages = canonname;
#endif
+ /*
+ * If the locale is C or C.UTF-8, and no provider was specified, use the
+ * builtin provider rather than libc.
+ */
+ if (!locale_provider_specified && locale_provider == COLLPROVIDER_LIBC)
+ {
+ if (strcmp(lc_ctype, lc_collate) == 0)
+ {
+ if (strcmp(lc_ctype, "C") == 0)
+ {
+ locale_provider = COLLPROVIDER_BUILTIN;
+ datlocale = "C";
+ }
+ else if (strcmp(lc_ctype, "C.UTF-8") == 0 ||
+ strcmp(lc_ctype, "C.UTF8") == 0)
+ {
+ locale_provider = COLLPROVIDER_BUILTIN;
+ datlocale = "C.UTF-8";
+ }
+ }
+ }
+
if (locale_provider != COLLPROVIDER_LIBC && datlocale == NULL)
pg_fatal("locale must be specified if provider is %s",
collprovider_name(locale_provider));
@@ -3362,6 +3385,8 @@ main(int argc, char *argv[])
"-c debug_discard_caches=1");
break;
case 15:
+ locale_provider_specified = true;
+
if (strcmp(optarg, "builtin") == 0)
locale_provider = COLLPROVIDER_BUILTIN;
else if (strcmp(optarg, "icu") == 0)
--
2.43.0