Re: Question regarding gettext behavior on iconv failure

2021-05-03 Thread Carlos O'Donell via austin-group-l at The Open Group
On Mon, May 3, 2021 at 5:38 PM Bruno Haible via austin-group-l at The
Open Group  wrote:
> 1 Empfaenger Chinese (??,???,??)  ??
>   * For the second line of output, in the first three cases, iconv()
> did transliteration, and the result was always an ASCII string.
> (The quality of glibc's transliteration of Hanzi characters to
> question marks can be debated, though.)

Completely off-topic, but is there a "high quality" transliteration of
Hanzi characters?
Would you have expected a phenome to be spelled out in ASCII?
I am not aware of any way to keep the meaning of the Hanzi characters in
ASCII, therefore you see the locales "default_missing" character U+003F '?'.

Cheers,
Carlos.



POSIX gettext() and the locale category

2021-05-03 Thread Bruno Haible via austin-group-l at The Open Group
https://posix.rhansen.org/p/gettext_split
says (line 217):

  "All of the functions in the gettext family of functions, except
   dcgettext(), search for messages objects only in the LC_MESSAGES
   category."

dcgettext_l, dcngettext, dcngettext_l also search in the specified
category.

Suggested wording:

  "All of the functions in the gettext family of functions, except
   for dcgettext(), dcgettext_l(), dcngettext(), and dcngettext_l()
   search for messages objects only in the LC_MESSAGES category."

Bruno



POSIX gettext() and chdir()

2021-05-03 Thread Bruno Haible via austin-group-l at The Open Group
https://posix.rhansen.org/p/gettext_split
says (line 273):

  "The bindtextdomain() function shall not perform pathname resolution
   on dirname (that is done by the gettext family of functions)."

This is indeed how GNU gettext and GNU libc behave. However, this is
not optimal:

  1) If the dirname is not absolute, the application cannot use chdir()
 at various points and still expect gettext() to work.

  2) Storing a dirname over a long time and using it only later during
 the gettext() call opens the door to file system races.

A more modern approach would be to

  * have bindtextdomain() open() the given directory and store the
resulting file descriptor,
  * have gettext() use openat() instead of open() for locating and opening
the message catalogs.

Such changes have not been implemented in GNU gettext and glibc so far.
But it would be good to not preclude such an improved implementation.

Suggested wording:

  "It is unspecified whether the bindtextdomain() function performs
   pathname resolution on dirname, or whether that is done by the
   gettext family of functions."

Bruno



POSIX gettext() and uselocale()

2021-05-03 Thread Bruno Haible via austin-group-l at The Open Group
https://posix.rhansen.org/p/gettext_split
says (line 92):

  "The returned string may be invalidated by a subsequent call to
   bind_textdomain_codeset(), bindtextdomain(), setlocale(),
   textdomain(), or uselocale()."

While in most programs setlocale(), textdomain(), bindtextdomain(),
bind_textdomain_codeset() are being called at the beginning of the
program execution, before any call to gettext(), the situation is
very different for uselocale().

1) uselocale() is meant to have effects ONLY on the thread in which it
   is called.

2) uselocale() is a helper function to implement *_l functions where
   the POSIX standard does not specify them or the system does not have
   them.
   For example, when a program wants to have a function to parse
   a number, recognizing only the ASCII digits and only '.' as decimal
   separator, a reliable way to implement such a function is by calling
   uselocale of the "C" locale, strtod(), and then uselocale() again
   to switch the thread back to the previous locale.

   If POSIX did not have uselocale(), it would need to provide many
   more *_l functions.

If the gettext() result may be invalidated by a uselocale() call (in
any other thread!), this would mean that

  ** Programs can use gettext() or uselocale() but not both. **

and - more or less -

  ** Multithreaded programs that use libraries (that may use uselocale())
 cannot use gettext(). **

I think that specifying gettext() to be so restricted is not useful.
It would make more sense to allow concurrent uselocale() calls.

Proposed wording:

  "The returned string may be invalidated by a subsequent call to
   bind_textdomain_codeset(), bindtextdomain(), setlocale(),
   or textdomain()."

Bruno



POSIX gettext() and the installation directories for .mo files

2021-05-03 Thread Bruno Haible via austin-group-l at The Open Group
https://posix.rhansen.org/p/gettext_split
says (line 77..79)

  "For each locale name in LANGUAGE, or if LANGUAGE is not set or is
   empty, or no suitable messages object is found in processing LANGUAGE,
   the pathname used to locate the messages object shall be
   dirname/localename/categoryname/textdomainname.mo, where:
   ...
   For the LANGUAGE search, the localename part is each locale name from
   LANGUAGE in turn.  For the single-locale search, the localename part
   is the name of the current locale, or the locale specified in an *_l()
   function call, for the category named by categoryname."

This is NOT how GNU gettext behaves. If POSIX standardizes it like this,
GNU libc and GNU gettext will have the choice among
  (a) looking in different (and fewer) directories than they do today,
  causing major i18n dysfunctionality to users, until the users
  have set up lots of symbolic links between directories, or
  (b) violating POSIX in this point.

I will vote for (b).

Namely, what GNU gettext does is to look in SEVERAL (not ONE) directories
per LANGUAGE element.

The localename parts of these directories are constructed from the language
identifier (element of LANGUAGE) or locale name. For example:

* The language identifier 'de' gives rise to the localename part
de

* The language identifier 'de_AT' gives rise to the localename parts
de_AT
de

* The locale name 'de_AT.UTF-8' gives rise to the localename parts
de_AT.UTF-8
de_AT.utf8
de_AT
de.UTF-8
de.utf8
de

* The locale name 'uz_UZ.UTF-8@cyrillic gives rise to the localename parts
uz_UZ.UTF-8@cyrillic
uz_UZ.utf8@cyrillic
uz_UZ@cyrillic
uz.UTF-8@cyrillic
uz.utf8@cyrillic
uz@cyrillic
uz_UZ.UTF-8
uz_UZ.utf8
uz_UZ
uz.UTF-8
uz.utf8
uz

This list of directories is important for people who live in communities
which often (but not always) have translations of their own but can read
translations for other locales. In the examples above:

  * A user in Austria prefers translations for Austrian German, but can
also read German with no problem.

  * A user in Uzbekistan may prefer translations in Cyrillic but can also
read translations in Latin. [1]

If above text was adopted, it would have the consequences that

  1) Many symbolic links are needed in /usr/share/locale/. Solaris 11.4
 is a system that implements gettext() as described in above text,
 and it has the links shown below [2].

  2) Users who want to create a new locale (e.g. for English in Australia)
 will have to create a symlink
 /usr/share/locale/en_AU -> /usr/share/locale/en
 and so on for each custom locale.

  3) Users who install packages in non-privileged directories (for GNU
 programs, that's the --prefix=PREFIX option) will have to create the
 same amount of symbolic links in their PREFIX/share/locale/ directory.

  4) Users will have to set fallback logic in their LANGUAGE environment
 variable

   LANGUAGE=de_AT:de_DE

 instead of having it built-in:

   LANGUAGE=de_AT

This is BAD, BAD, BAD.

Bruno

[1] https://en.wikipedia.org/wiki/Uzbek_alphabet
[2]
$ ls -l /usr/share/locale
total 102
drwxr-xr-x   3 root other  3 Oct 13  2018 C
drwxr-xr-x   3 root other  4 Oct 13  2018 de
lrwxrwxrwx   1 root root   2 Oct 13  2018 de_DE -> de
lrwxrwxrwx   1 root root   2 Oct 13  2018 de_DE.ISO8859-1 -> de
lrwxrwxrwx   1 root root   2 Oct 13  2018 de_DE.ISO8859-15 -> de
lrwxrwxrwx   1 root root   2 Oct 13  2018 de_DE.UTF-8 -> de
lrwxrwxrwx   1 root root   2 Oct 13  2018 de.ISO8859-15 -> de
drwxr-xr-x   3 root other  3 Oct 13  2018 de.us-ascii
lrwxrwxrwx   1 root root   2 Oct 13  2018 de.UTF-8 -> de
drwxr-xr-x   3 root other  3 Oct 13  2018 en
drwxr-xr-x   3 root other  3 Oct 13  2018 en_US
drwxr-xr-x   3 root other  3 Oct 13  2018 en@boldquot
drwxr-xr-x   3 root other  3 Oct 13  2018 en@quot
drwxr-xr-x   3 root other  3 Oct 13  2018 en@shaw
drwxr-xr-x   3 root other  4 Oct 13  2018 es
drwxr-xr-x   3 root other  3 Oct 13  2018 es_ES
lrwxrwxrwx   1 root root   2 Oct 13  2018 es_ES.ISO8859-1 -> es
lrwxrwxrwx   1 root root   2 Oct 13  2018 es_ES.ISO8859-15 -> es
lrwxrwxrwx   1 root root   2 Oct 13  2018 es_ES.UTF-8 -> es
lrwxrwxrwx   1 root root   2 Oct 13  2018 es.ISO8859-15 -> es
lrwxrwxrwx   1 root root   2 Oct 13  2018 es.UTF-8 -> es
drwxr-xr-x   3 root other  4 Oct 13  2018 fr
lrwxrwxrwx   1 root root   2 Oct 13  2018 fr_FR -> fr
lrwxrwxrwx   1 root root   2 Oct 13  2018 fr_FR.ISO8859-1 -> fr
lrwxrwxrwx   1 root root   2 Oct 13  2018 fr_FR.ISO8859-15 -> fr
lrwxrwxrwx   1 root root   2 Oct 13  2018 fr_FR.UTF-8 -> fr
lrwxrwxrwx   1 root root   

POSIX gettext() and the LANGUAGE environment variable

2021-05-03 Thread Bruno Haible via austin-group-l at The Open Group
https://posix.rhansen.org/p/gettext_split
says (line 72)

  "For the LANGUAGE search, the value of the LANGUAGE environment
   variable shall be a list of one or more locale names separated
   by a colon (':') character."

This is NOT how GNU gettext behaves. If POSIX standardizes it like this,
GNU libc and GNU gettext will have the choice among
  (a) forcing users to specify their preferences in a user-unfriendly way, or
  (b) violating POSIX in this point.

I will vote for (b).

Namely, what gettext expects in the LANGUAGE environment variable is
documented in
https://www.gnu.org/software/gettext/manual/html_node/The-LANGUAGE-variable.html

In a modern glibc system, the locale names are essentially

  C
  POSIX
  en_US.UTF-8
  de_DE.UTF-8
  fr_FR.UTF-8
  pt_BR.UTF-8
  etc.

We do NOT want that user who wants to see messages in Arabic (1st preference)
or French (2nd preference) has to set

  LANGUAGE=ar_EG.UTF-8:fr_FR.UTF-8

We want that the user merely has to write

  LANGUAGE=ar:fr

Suggested wording change:

  "For the LANGUAGE search, the value of the LANGUAGE environment
   variable shall be a list of one or more language identifiers.
   A language identifier is a locale name with the '.codeset' part
   removed and optionally also the territory and/or the modifiers removed.
   In the simplest case, a language identifier consists of just an ISO 639-1
   code."

Bruno



POSIX gettext() and iconv_open()

2021-05-03 Thread Bruno Haible via austin-group-l at The Open Group
https://posix.rhansen.org/p/gettext_split
says (line 85)

  "The conversion shall be performed as if by a call to iconv() using a
   conversion descriptor returned by iconv_open(,
   )."

This is NOT how GNU gettext behaves. If POSIX standardizes it like this,
GNU libc and GNU gettext will have the choice among
  (a) dropping the transliteration during charset conversion of the
  messages,
  (b) violating POSIX in this point.

I will vote for (b).

Namely, what GNU gettext does, is to allocate an iconv descriptor that
allows transliteration. For example, when converting from ISO-8859-1 or
UTF-8 to ASCII, it would transform

  "1 Empfänger" to"1 Empfaenger"  (glibc in German locale)
or"1 Empf\"anger"  (GNU libiconv)

Suggested wording change:

  "The conversion shall be performed as if by a call to iconv() using a
   conversion descriptor that converts from the returned 
   to the , with an implementation-dependent conversion
   quality."

Bruno




Re: Question regarding gettext behavior on iconv failure

2021-05-03 Thread Bruno Haible via austin-group-l at The Open Group
Hi Eric,

> The example in question set up several .po files and a specific
> environment to test various pluralization/transcoding fallbacks, and
> concludes with a snippet where a string with an encoding error in
> ISO-8859-1 is output in spite of an iconv failure, rather than the
> string passed in to ngettext():
> 
> 
> n_recipients = 1;
> // The following outputs "1 Empfänger" encoded in UTF-8:
> printf("%s\n", ngettext("recipient", "recipients", n_recipients));
> 
> bind_textdomain_codeset("mail", "ASCII");
> 
> n_recipients = 1;
> // The following outputs "recipient" with the same encoding as the
> "recipient"
> // argument to ngettext (remember, the the system is assumed to not
> support
> // conversion from ISO/IEC 8859-1 to ASCII):
> printf("%s\n", ngettext("recipient", "recipients", n_recipients));
> // On GNU gettext, "1 Empfänger" is output in ISO-8859-1 here (i.e.
> no conversion is done). I think we already agreed on considering this
> behavior a bug,

I cannot reproduce this. Find attached my (complete) test case.

GNU gettext uses iconv_open() with arguments that indicate that a not 1:1
conversion (e.g. transliteration) is better than a failure.

The result thus depends on the iconv implementation. For GNU gettext
the recommended iconv implementations are:
  - on glibc systems: GNU libc,
  - otherwise: GNU libiconv.
Therefore here are the results on GNU libc (2.32) and on some other OS
(FreeBSD 13) with GNU libiconv:

With a mail.po that contains only umlauts:

Output on glibc systems (e.g. 2.32):
1 Empfänger
1 Empfaenger

Output on non-glibc systems with GNU libiconv:
1 Empfänger
1 Empf"anger

With a mail-utf8.po that contains also Hanzi characters:

Output on glibc systems (e.g. 2.32):
1 Empfänger Chinese (中文,普通话,汉语)  你好
1 Empfaenger Chinese (??,???,??)  ??

Output on non-glibc systems with GNU libiconv:
1 Empfänger Chinese (中文,普通话,汉语)  你好
recipient

As you can see:

  * For the first line of output, since the output encoding is UTF-8,
iconv() never needed transliteration and never failed.

  * For the second line of output, in the first three cases, iconv()
did transliteration, and the result was always an ASCII string.
(The quality of glibc's transliteration of Hanzi characters to
question marks can be debated, though.)

  * In the last case, iconv() failed, and thus GNU gettext output
the corresponding argument to ngettext() untranslated.

> This raises a few questions: does the GNU gettext team agree that this
> can be considered a bug

No. Please provide a reproducible test case, that produces wrong results
on an interesting platform. NetBSD 3.0 or IRIX 6.5, for example, don't
count.

Bruno
/* Preparations:
- Install locale named 'de_DE.UTF-8' (using localedef).
- Find attached mail.po
- $ mkdir -p de/LC_MESSAGES
  $ msgfmt -c -o de/LC_MESSAGES/mail.mo mail.po
  or
  $ msgfmt -c -o de/LC_MESSAGES/mail.mo mail-utf8.po
- $ gcc -Wall foo.c
- $ LC_ALL=de_DE.UTF-8 ./a.out
*/

#include 
#include 
#include 

int
main ()
{
  if (setlocale (LC_ALL, "") == NULL)
return 1;
  textdomain ("mail");
  bindtextdomain ("mail", ".");

  unsigned int n_recipients;

  n_recipients = 1;
  // The following outputs "1 Empfänger" encoded in UTF-8:
  printf("%s\n", ngettext("recipient", "recipients", n_recipients));

  bind_textdomain_codeset("mail", "ASCII");

  n_recipients = 1;
  // The following outputs "recipient" with the same encoding as the "recipient"
  // argument to ngettext (remember, the the system is assumed to not support
  // conversion from ISO/IEC 8859-1 to ASCII):
  printf("%s\n", ngettext("recipient", "recipients", n_recipients));
  // On GNU gettext, "1 Empfänger" is output in ISO-8859-1 here (i.e. no conversion is done). I think we already agreed on considering this behavior a bug,
}
/*
With a mail.po that contains only umlauts:

Output on glibc systems (e.g. 2.32):
1 Empfänger
1 Empfaenger

Output on non-glibc systems with GNU libiconv:
1 Empfänger
1 Empf"anger

With a mail-utf8.po that contains also Hanzi characters:

Output on glibc systems (e.g. 2.32):
1 Empfänger Chinese (中文,普通话,汉语)  你好
1 Empfaenger Chinese (??,???,??)  ??

Output on non-glibc systems with GNU libiconv:
1 Empfänger Chinese (中文,普通话,汉语)  你好
recipient

*/
msgid ""
msgstr ""
"Content-Type: text/plain; charset=ISO_8859-1\n"
"Plural-Forms: nplurals=4; plural= n==1?0: (n>1 && n< 5)?1: (n==0)? 2:3;\n"

msgid "recipient"
msgid_plural "recipients"
msgstr[0] "1 Empfänger"
msgstr[1] "2 bis 4 Empfänger"
msgstr[2] "keine Empfänger"
msgstr[3] "mehr als 4 Empfänger"
msgid ""
msgstr ""
"Content-Type: text/plain; charset=UTF-8\n"
"Plural-Forms: nplurals=4; plural= n==1?0: (n>1 && n< 5)?1: (n==0)? 2:3;\n"

msgid "recipient"
msgid_plural "recipients"
msgstr[0] "1 Empfänger Chinese (中文,普通话,汉语)  你好"
msgstr[1] "2 bis 4 Empfänger"
msgstr[2] 

Question regarding gettext behavior on iconv failure

2021-05-03 Thread Eric Blake via austin-group-l at The Open Group
Hello GNU gettext maintainers,

In today's Austin Group meeting, we developed an example of using the
proposed POSIX standardization of gettext() and encountered a situation
where we felt that GNU gettext may have a bug.  For context, the entire
example is at:
https://posix.rhansen.org/p/gettext_split

The example in question set up several .po files and a specific
environment to test various pluralization/transcoding fallbacks, and
concludes with a snippet where a string with an encoding error in
ISO-8859-1 is output in spite of an iconv failure, rather than the
string passed in to ngettext():


n_recipients = 1;
// The following outputs "1 Empfänger" encoded in UTF-8:
printf("%s\n", ngettext("recipient", "recipients", n_recipients));

bind_textdomain_codeset("mail", "ASCII");

n_recipients = 1;
// The following outputs "recipient" with the same encoding as the
"recipient"
// argument to ngettext (remember, the the system is assumed to not
support
// conversion from ISO/IEC 8859-1 to ASCII):
printf("%s\n", ngettext("recipient", "recipients", n_recipients));
// On GNU gettext, "1 Empfänger" is output in ISO-8859-1 here (i.e.
no conversion is done). I think we already agreed on considering this
behavior a bug,

This raises a few questions: does the GNU gettext team agree that this
can be considered a bug, and if so, will a future gettext release behave
differently?  Or if it is intentional and not a bug, can you provide
justification for the behavior as well as tweaks to the proposed
standard wording for gettext requirements and the worked example?

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org