On Wed, 2024-01-10 at 23:56 +0100, Daniel Verite wrote: > $ bin/initdb --locale=C.UTF-8 --locale-provider=builtin -D/tmp/pgdata > > The database cluster will be initialized with this locale > configuration: > default collation provider: builtin > default collation locale: C.UTF-8 > LC_COLLATE: C.UTF-8 > LC_CTYPE: C.UTF-8 > LC_MESSAGES: C.UTF-8 > LC_MONETARY: C.UTF-8 > LC_NUMERIC: C.UTF-8 > LC_TIME: C.UTF-8 > The default database encoding has accordingly been set to "UTF8". > The default text search configuration will be set to "english". > > This is from an environment where LANG=fr_FR.UTF-8 > > I would expect all LC_* variables to be fr_FR.UTF-8, and the default > text search configuration to be "french".
You can get the behavior you want by doing: initdb --builtin-locale=C.UTF-8 --locale-provider=builtin \ -D/tmp/pgdata where "--builtin-locale" is analogous to "--icu-locale". It looks like I forgot to document the new initdb option, which seems to be the source of the confusion. Sorry, I'll fix that in the next patch set. (See examples in the initdb tests.) I think this answers some of your follow-up questions as well. > A related comment is about naming the builtin locale C.UTF-8, the > same > name as in libc. On one hand this is semantically sound, but on the > other hand, it's likely to confuse people. What about using > completely > different names, like "pg_unicode" or something else prefixed by > "pg_" > both for the locale name and the collation name (currently > C.UTF-8/c_utf8)? I'm flexible on naming, but here are my thoughts: * A "pg_" prefix makes sense. * If we named it something like "pg_unicode" someone might expect it to sort using the root collation. * The locale name "C.UTF-8" is nice because it implies things about both the collation and the character behavior. It's also nice because on at least some platforms, the behavior is almost identical to the libc locale of the same name. * UCS_BASIC might be a good name, because it also seems to carry the right meanings, but that name is already taken. * We also might to support variations, such as full case mapping (which uppercases "ß" to "SS", as the SQL standard requires), or perhaps the "standard" flavor of regexes (which don't count all symbols as punctuation). Leaving some room to name those variations would be a good idea. Regards, Jeff Davis