POSIX gettext() and the installation directories for .mo files

2022-01-16 Thread Bruno Haible via austin-group-l at The Open Group
[First sent on 2021-05-03. Resending because it has not been fully handled.]

https://posix.rhansen.org/p/gettext_draft
says (line 343..345)

  "For each locale name in LANGUAGE, or if LANGUAGE is not set or is
   empty, or no suitable messages object is found in processing LANGUAGE,
   the pathname used to locate the messages object shall be
   dirname/localename/categoryname/textdomainname.mo, where:
   ...
   For the LANGUAGE search, the localename part is each locale name from
   LANGUAGE in turn   For the single-locale search, the localename part
   is the name of the current locale, or the locale specified in an *_l()
   function call, for the category named by categoryname."

This text is ambiguous. The first cited paragraph says that it looks in a
single directory; the second cited paragraph says that it tries locale names
"in turn". This is contradictory. Also when it says "in turn" it does not
say what the stopping condition it: does it loop
  - until an existing locale name is found?
  - until a file dirname/localename/categoryname/textdomainname.mo is found?
  - until a file dirname/localename/categoryname/textdomainname.mo is found
that contains a translation for the given msgid?

For most of the interpretations of this set of paragraphs, this is NOT how
GNU gettext behaves. If POSIX standardizes it like this, GNU libc and
GNU gettext will have the choice among
  (a) looking in different (and fewer) directories than they do today,
  causing major i18n dysfunctionality to users, until the users
  have set up lots of symbolic links between directories, or
  (b) violating POSIX in this point.

I will vote for (b).

Namely, what GNU gettext does is to look in SEVERAL (not ONE) directories
per LANGUAGE element. This is true for *both* the LANGUAGE search and the
single-locale search.

The localename parts of these directories are constructed from the language
identifier (element of LANGUAGE) or locale name. For example:

* The language identifier 'de' gives rise to the localename part
de

* The language identifier 'de_AT' gives rise to the localename parts
de_AT
de

* The locale name 'de_AT.UTF-8' gives rise to the localename parts
de_AT.UTF-8
de_AT.utf8
de_AT
de.UTF-8
de.utf8
de

* The locale name 'uz_UZ.UTF-8@cyrillic gives rise to the localename parts
uz_UZ.UTF-8@cyrillic
uz_UZ.utf8@cyrillic
uz_UZ@cyrillic
uz.UTF-8@cyrillic
uz.utf8@cyrillic
uz@cyrillic
uz_UZ.UTF-8
uz_UZ.utf8
uz_UZ
uz.UTF-8
uz.utf8
uz

This list of directories is important for people who live in communities
which often (but not always) have translations of their own but can read
translations for other locales. In the examples above:

  * A user in Austria prefers translations for Austrian German, but can
also read German with no problem.

  * A user in Uzbekistan may prefer translations in Cyrillic but can also
read translations in Latin. [1]

If above text was adopted, it would have the consequences that

  1) Many symbolic links are needed in /usr/share/locale/. Solaris 11.4
 is a system that implements gettext() as described in above text,
 and it has the links shown below [2].

  2) Users who want to create a new locale (e.g. for English in Australia)
 will have to create a symlink
 /usr/share/locale/en_AU -> /usr/share/locale/en
 and so on for each custom locale.

  3) Users who install packages in non-privileged directories (for GNU
 programs, that's the --prefix=PREFIX option) will have to create the
 same amount of symbolic links in their PREFIX/share/locale/ directory.

  4) Users will have to set fallback logic in their LANGUAGE environment
 variable

   LANGUAGE=de_AT:de_DE

 instead of having it built-in:

   LANGUAGE=de_AT

This is BAD, BAD, BAD.

Bruno

[1] https://en.wikipedia.org/wiki/Uzbek_alphabet
[2]
$ ls -l /usr/share/locale
total 102
drwxr-xr-x   3 root other  3 Oct 13  2018 C
drwxr-xr-x   3 root other  4 Oct 13  2018 de
lrwxrwxrwx   1 root root   2 Oct 13  2018 de_DE -> de
lrwxrwxrwx   1 root root   2 Oct 13  2018 de_DE.ISO8859-1 -> de
lrwxrwxrwx   1 root root   2 Oct 13  2018 de_DE.ISO8859-15 -> de
lrwxrwxrwx   1 root root   2 Oct 13  2018 de_DE.UTF-8 -> de
lrwxrwxrwx   1 root root   2 Oct 13  2018 de.ISO8859-15 -> de
drwxr-xr-x   3 root other  3 Oct 13  2018 de.us-ascii
lrwxrwxrwx   1 root root   2 Oct 13  2018 de.UTF-8 -> de
drwxr-xr-x   3 root other  3 Oct 13  2018 en
drwxr-xr-x   3 root other  3 Oct 13  2018 en_US
drwxr-xr-x   3 root other  3 Oct 13  2018 en@boldquot
drwxr-xr-x   3 root other  3 Oct 13  2018 en@quot
drwxr-xr-x   3 root other  3 Oct 13  2018 en@shaw
drwxr-xr-x   3 root other  4 Oct 13  2018 es
drwxr-xr-x   3 root other  3 Oct 13  2018 es_ES

POSIX gettext() and the installation directories for .mo files

2021-05-03 Thread Bruno Haible via austin-group-l at The Open Group
https://posix.rhansen.org/p/gettext_split
says (line 77..79)

  "For each locale name in LANGUAGE, or if LANGUAGE is not set or is
   empty, or no suitable messages object is found in processing LANGUAGE,
   the pathname used to locate the messages object shall be
   dirname/localename/categoryname/textdomainname.mo, where:
   ...
   For the LANGUAGE search, the localename part is each locale name from
   LANGUAGE in turn.  For the single-locale search, the localename part
   is the name of the current locale, or the locale specified in an *_l()
   function call, for the category named by categoryname."

This is NOT how GNU gettext behaves. If POSIX standardizes it like this,
GNU libc and GNU gettext will have the choice among
  (a) looking in different (and fewer) directories than they do today,
  causing major i18n dysfunctionality to users, until the users
  have set up lots of symbolic links between directories, or
  (b) violating POSIX in this point.

I will vote for (b).

Namely, what GNU gettext does is to look in SEVERAL (not ONE) directories
per LANGUAGE element.

The localename parts of these directories are constructed from the language
identifier (element of LANGUAGE) or locale name. For example:

* The language identifier 'de' gives rise to the localename part
de

* The language identifier 'de_AT' gives rise to the localename parts
de_AT
de

* The locale name 'de_AT.UTF-8' gives rise to the localename parts
de_AT.UTF-8
de_AT.utf8
de_AT
de.UTF-8
de.utf8
de

* The locale name 'uz_UZ.UTF-8@cyrillic gives rise to the localename parts
uz_UZ.UTF-8@cyrillic
uz_UZ.utf8@cyrillic
uz_UZ@cyrillic
uz.UTF-8@cyrillic
uz.utf8@cyrillic
uz@cyrillic
uz_UZ.UTF-8
uz_UZ.utf8
uz_UZ
uz.UTF-8
uz.utf8
uz

This list of directories is important for people who live in communities
which often (but not always) have translations of their own but can read
translations for other locales. In the examples above:

  * A user in Austria prefers translations for Austrian German, but can
also read German with no problem.

  * A user in Uzbekistan may prefer translations in Cyrillic but can also
read translations in Latin. [1]

If above text was adopted, it would have the consequences that

  1) Many symbolic links are needed in /usr/share/locale/. Solaris 11.4
 is a system that implements gettext() as described in above text,
 and it has the links shown below [2].

  2) Users who want to create a new locale (e.g. for English in Australia)
 will have to create a symlink
 /usr/share/locale/en_AU -> /usr/share/locale/en
 and so on for each custom locale.

  3) Users who install packages in non-privileged directories (for GNU
 programs, that's the --prefix=PREFIX option) will have to create the
 same amount of symbolic links in their PREFIX/share/locale/ directory.

  4) Users will have to set fallback logic in their LANGUAGE environment
 variable

   LANGUAGE=de_AT:de_DE

 instead of having it built-in:

   LANGUAGE=de_AT

This is BAD, BAD, BAD.

Bruno

[1] https://en.wikipedia.org/wiki/Uzbek_alphabet
[2]
$ ls -l /usr/share/locale
total 102
drwxr-xr-x   3 root other  3 Oct 13  2018 C
drwxr-xr-x   3 root other  4 Oct 13  2018 de
lrwxrwxrwx   1 root root   2 Oct 13  2018 de_DE -> de
lrwxrwxrwx   1 root root   2 Oct 13  2018 de_DE.ISO8859-1 -> de
lrwxrwxrwx   1 root root   2 Oct 13  2018 de_DE.ISO8859-15 -> de
lrwxrwxrwx   1 root root   2 Oct 13  2018 de_DE.UTF-8 -> de
lrwxrwxrwx   1 root root   2 Oct 13  2018 de.ISO8859-15 -> de
drwxr-xr-x   3 root other  3 Oct 13  2018 de.us-ascii
lrwxrwxrwx   1 root root   2 Oct 13  2018 de.UTF-8 -> de
drwxr-xr-x   3 root other  3 Oct 13  2018 en
drwxr-xr-x   3 root other  3 Oct 13  2018 en_US
drwxr-xr-x   3 root other  3 Oct 13  2018 en@boldquot
drwxr-xr-x   3 root other  3 Oct 13  2018 en@quot
drwxr-xr-x   3 root other  3 Oct 13  2018 en@shaw
drwxr-xr-x   3 root other  4 Oct 13  2018 es
drwxr-xr-x   3 root other  3 Oct 13  2018 es_ES
lrwxrwxrwx   1 root root   2 Oct 13  2018 es_ES.ISO8859-1 -> es
lrwxrwxrwx   1 root root   2 Oct 13  2018 es_ES.ISO8859-15 -> es
lrwxrwxrwx   1 root root   2 Oct 13  2018 es_ES.UTF-8 -> es
lrwxrwxrwx   1 root root   2 Oct 13  2018 es.ISO8859-15 -> es
lrwxrwxrwx   1 root root   2 Oct 13  2018 es.UTF-8 -> es
drwxr-xr-x   3 root other  4 Oct 13  2018 fr
lrwxrwxrwx   1 root root   2 Oct 13  2018 fr_FR -> fr
lrwxrwxrwx   1 root root   2 Oct 13  2018 fr_FR.ISO8859-1 -> fr
lrwxrwxrwx   1 root root   2 Oct 13  2018 fr_FR.ISO8859-15 -> fr
lrwxrwxrwx   1 root root   2 Oct 13  2018 fr_FR.UTF-8 -> fr
lrwxrwxrwx   1 root root