Hello,

In several systems, you can ask for the version of a locale's
definition.  This can be important for persistent data structures that
live long enough for the locale definition to change, but depend on
its stability.  Concretely, I mean things like on-disk btree indexes
in databases that are ordered using strcoll_l(), when natural language
ordering is desired.

http://www.unicode.org/reports/tr10/ says: "Over time, collation order
will vary: there may be fixes needed as more information becomes
available about languages; there may be new government or industry
standards for the language that require changes; and finally, new
characters added to the Unicode Standard will interleave with the
previously-defined ones. This means that collations must be carefully
versioned."

For this reason, many database systems avoid using operating system
locale support, but that has other downsides including disagreeing
with other software on the same system.  While a system locale
implementation could conceivably offer a way to open different
versions of a locale explicitly or through the file system or
environment, I think it would be good at a minimum for an application
to have a standard way to know if the definition of an existing locale
opened by name has changed.  Several locale APIs expose this
information already:

https://man.freebsd.org/cgi/man.cgi?query=querylocale&sektion=3&format=html
https://learn.microsoft.com/en-us/windows/win32/api/winnls/nf-winnls-getnlsversionex
https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/ucol_8h.html

For example, PostgreSQL (which can use POSIX, Windows or ICU locales)
checks these values to see if they have changed unexpectedly, and
complains that affected indexes must be rebuilt if so.  This usually
happens after an operating system upgrade or migration to a different
computer.  Not doing so can result in data corruption, if btree
traversals take wrong turns.

In a hypothetical standard API, the values returned could be left
unspecified, only to be used to compare for equality with an earlier
stored value.  That is the case with the above-mentioned systems.  In
practice, they combine elements like the CLDR version, Unicode version
etc.

Getting the information out:

Since POSIX 2024 has standardised getlocalename_l(), which is
approximately the same as querylocale() (found on macOS and the BSDs),
FreeBSD's querylocale() extension LC_VERSION_MASK wouldn't make much
sense as a proposal.  Other ideas include:

1.  nl_langinfo_l(LC_LOCALE_VERSION(category), loc)

Inspired by glibc's non-standard
nl_langinfo_l(LC_LOCALE_NAME(category), loc), which is like
getlocalename_l(category, loc).  Hammering a category into an nl_item
parameter with a function-like macro is perhaps a little unusual.

2.  getlocaleversion_l(category, loc)

Inspired by standard getlocalename_l(category, loc).  Takes a category
explicitly, which seems a little more natural to me, but it also
creates a new function name.

Getting the information in:

The localedef locale definition source format could potentially define
syntax for providing the version string, but I haven't studied this
part yet.  (FreeBSD's localedef currently has a -V switch to provide a
version string, and the locales in the base system are compiled with
that set to the Unicode CLDR version of the source data for
LC_COLLATE.)

I'd love to hear any feedback on the general idea, or relevant systems
I may be missing, before trying to propose something more concrete.

Thanks for reading,

Thomas Munro

Reply via email to