Re: Question regarding gettext behavior on iconv failure
On Mon, May 3, 2021 at 5:38 PM Bruno Haible via austin-group-l at The Open Group wrote: > 1 Empfaenger Chinese (??,???,??) ?? > * For the second line of output, in the first three cases, iconv() > did transliteration, and the result was always an ASCII string. > (The quality of glibc's transliteration of Hanzi characters to > question marks can be debated, though.) Completely off-topic, but is there a "high quality" transliteration of Hanzi characters? Would you have expected a phenome to be spelled out in ASCII? I am not aware of any way to keep the meaning of the Hanzi characters in ASCII, therefore you see the locales "default_missing" character U+003F '?'. Cheers, Carlos.
POSIX gettext() and the locale category
https://posix.rhansen.org/p/gettext_split says (line 217): "All of the functions in the gettext family of functions, except dcgettext(), search for messages objects only in the LC_MESSAGES category." dcgettext_l, dcngettext, dcngettext_l also search in the specified category. Suggested wording: "All of the functions in the gettext family of functions, except for dcgettext(), dcgettext_l(), dcngettext(), and dcngettext_l() search for messages objects only in the LC_MESSAGES category." Bruno
POSIX gettext() and chdir()
https://posix.rhansen.org/p/gettext_split says (line 273): "The bindtextdomain() function shall not perform pathname resolution on dirname (that is done by the gettext family of functions)." This is indeed how GNU gettext and GNU libc behave. However, this is not optimal: 1) If the dirname is not absolute, the application cannot use chdir() at various points and still expect gettext() to work. 2) Storing a dirname over a long time and using it only later during the gettext() call opens the door to file system races. A more modern approach would be to * have bindtextdomain() open() the given directory and store the resulting file descriptor, * have gettext() use openat() instead of open() for locating and opening the message catalogs. Such changes have not been implemented in GNU gettext and glibc so far. But it would be good to not preclude such an improved implementation. Suggested wording: "It is unspecified whether the bindtextdomain() function performs pathname resolution on dirname, or whether that is done by the gettext family of functions." Bruno
POSIX gettext() and uselocale()
https://posix.rhansen.org/p/gettext_split says (line 92): "The returned string may be invalidated by a subsequent call to bind_textdomain_codeset(), bindtextdomain(), setlocale(), textdomain(), or uselocale()." While in most programs setlocale(), textdomain(), bindtextdomain(), bind_textdomain_codeset() are being called at the beginning of the program execution, before any call to gettext(), the situation is very different for uselocale(). 1) uselocale() is meant to have effects ONLY on the thread in which it is called. 2) uselocale() is a helper function to implement *_l functions where the POSIX standard does not specify them or the system does not have them. For example, when a program wants to have a function to parse a number, recognizing only the ASCII digits and only '.' as decimal separator, a reliable way to implement such a function is by calling uselocale of the "C" locale, strtod(), and then uselocale() again to switch the thread back to the previous locale. If POSIX did not have uselocale(), it would need to provide many more *_l functions. If the gettext() result may be invalidated by a uselocale() call (in any other thread!), this would mean that ** Programs can use gettext() or uselocale() but not both. ** and - more or less - ** Multithreaded programs that use libraries (that may use uselocale()) cannot use gettext(). ** I think that specifying gettext() to be so restricted is not useful. It would make more sense to allow concurrent uselocale() calls. Proposed wording: "The returned string may be invalidated by a subsequent call to bind_textdomain_codeset(), bindtextdomain(), setlocale(), or textdomain()." Bruno
POSIX gettext() and the installation directories for .mo files
https://posix.rhansen.org/p/gettext_split says (line 77..79) "For each locale name in LANGUAGE, or if LANGUAGE is not set or is empty, or no suitable messages object is found in processing LANGUAGE, the pathname used to locate the messages object shall be dirname/localename/categoryname/textdomainname.mo, where: ... For the LANGUAGE search, the localename part is each locale name from LANGUAGE in turn. For the single-locale search, the localename part is the name of the current locale, or the locale specified in an *_l() function call, for the category named by categoryname." This is NOT how GNU gettext behaves. If POSIX standardizes it like this, GNU libc and GNU gettext will have the choice among (a) looking in different (and fewer) directories than they do today, causing major i18n dysfunctionality to users, until the users have set up lots of symbolic links between directories, or (b) violating POSIX in this point. I will vote for (b). Namely, what GNU gettext does is to look in SEVERAL (not ONE) directories per LANGUAGE element. The localename parts of these directories are constructed from the language identifier (element of LANGUAGE) or locale name. For example: * The language identifier 'de' gives rise to the localename part de * The language identifier 'de_AT' gives rise to the localename parts de_AT de * The locale name 'de_AT.UTF-8' gives rise to the localename parts de_AT.UTF-8 de_AT.utf8 de_AT de.UTF-8 de.utf8 de * The locale name 'uz_UZ.UTF-8@cyrillic gives rise to the localename parts uz_UZ.UTF-8@cyrillic uz_UZ.utf8@cyrillic uz_UZ@cyrillic uz.UTF-8@cyrillic uz.utf8@cyrillic uz@cyrillic uz_UZ.UTF-8 uz_UZ.utf8 uz_UZ uz.UTF-8 uz.utf8 uz This list of directories is important for people who live in communities which often (but not always) have translations of their own but can read translations for other locales. In the examples above: * A user in Austria prefers translations for Austrian German, but can also read German with no problem. * A user in Uzbekistan may prefer translations in Cyrillic but can also read translations in Latin. [1] If above text was adopted, it would have the consequences that 1) Many symbolic links are needed in /usr/share/locale/. Solaris 11.4 is a system that implements gettext() as described in above text, and it has the links shown below [2]. 2) Users who want to create a new locale (e.g. for English in Australia) will have to create a symlink /usr/share/locale/en_AU -> /usr/share/locale/en and so on for each custom locale. 3) Users who install packages in non-privileged directories (for GNU programs, that's the --prefix=PREFIX option) will have to create the same amount of symbolic links in their PREFIX/share/locale/ directory. 4) Users will have to set fallback logic in their LANGUAGE environment variable LANGUAGE=de_AT:de_DE instead of having it built-in: LANGUAGE=de_AT This is BAD, BAD, BAD. Bruno [1] https://en.wikipedia.org/wiki/Uzbek_alphabet [2] $ ls -l /usr/share/locale total 102 drwxr-xr-x 3 root other 3 Oct 13 2018 C drwxr-xr-x 3 root other 4 Oct 13 2018 de lrwxrwxrwx 1 root root 2 Oct 13 2018 de_DE -> de lrwxrwxrwx 1 root root 2 Oct 13 2018 de_DE.ISO8859-1 -> de lrwxrwxrwx 1 root root 2 Oct 13 2018 de_DE.ISO8859-15 -> de lrwxrwxrwx 1 root root 2 Oct 13 2018 de_DE.UTF-8 -> de lrwxrwxrwx 1 root root 2 Oct 13 2018 de.ISO8859-15 -> de drwxr-xr-x 3 root other 3 Oct 13 2018 de.us-ascii lrwxrwxrwx 1 root root 2 Oct 13 2018 de.UTF-8 -> de drwxr-xr-x 3 root other 3 Oct 13 2018 en drwxr-xr-x 3 root other 3 Oct 13 2018 en_US drwxr-xr-x 3 root other 3 Oct 13 2018 en@boldquot drwxr-xr-x 3 root other 3 Oct 13 2018 en@quot drwxr-xr-x 3 root other 3 Oct 13 2018 en@shaw drwxr-xr-x 3 root other 4 Oct 13 2018 es drwxr-xr-x 3 root other 3 Oct 13 2018 es_ES lrwxrwxrwx 1 root root 2 Oct 13 2018 es_ES.ISO8859-1 -> es lrwxrwxrwx 1 root root 2 Oct 13 2018 es_ES.ISO8859-15 -> es lrwxrwxrwx 1 root root 2 Oct 13 2018 es_ES.UTF-8 -> es lrwxrwxrwx 1 root root 2 Oct 13 2018 es.ISO8859-15 -> es lrwxrwxrwx 1 root root 2 Oct 13 2018 es.UTF-8 -> es drwxr-xr-x 3 root other 4 Oct 13 2018 fr lrwxrwxrwx 1 root root 2 Oct 13 2018 fr_FR -> fr lrwxrwxrwx 1 root root 2 Oct 13 2018 fr_FR.ISO8859-1 -> fr lrwxrwxrwx 1 root root 2 Oct 13 2018 fr_FR.ISO8859-15 -> fr lrwxrwxrwx 1 root root 2 Oct 13 2018 fr_FR.UTF-8 -> fr lrwxrwxrwx 1 root root
POSIX gettext() and the LANGUAGE environment variable
https://posix.rhansen.org/p/gettext_split says (line 72) "For the LANGUAGE search, the value of the LANGUAGE environment variable shall be a list of one or more locale names separated by a colon (':') character." This is NOT how GNU gettext behaves. If POSIX standardizes it like this, GNU libc and GNU gettext will have the choice among (a) forcing users to specify their preferences in a user-unfriendly way, or (b) violating POSIX in this point. I will vote for (b). Namely, what gettext expects in the LANGUAGE environment variable is documented in https://www.gnu.org/software/gettext/manual/html_node/The-LANGUAGE-variable.html In a modern glibc system, the locale names are essentially C POSIX en_US.UTF-8 de_DE.UTF-8 fr_FR.UTF-8 pt_BR.UTF-8 etc. We do NOT want that user who wants to see messages in Arabic (1st preference) or French (2nd preference) has to set LANGUAGE=ar_EG.UTF-8:fr_FR.UTF-8 We want that the user merely has to write LANGUAGE=ar:fr Suggested wording change: "For the LANGUAGE search, the value of the LANGUAGE environment variable shall be a list of one or more language identifiers. A language identifier is a locale name with the '.codeset' part removed and optionally also the territory and/or the modifiers removed. In the simplest case, a language identifier consists of just an ISO 639-1 code." Bruno
POSIX gettext() and iconv_open()
https://posix.rhansen.org/p/gettext_split says (line 85) "The conversion shall be performed as if by a call to iconv() using a conversion descriptor returned by iconv_open(, )." This is NOT how GNU gettext behaves. If POSIX standardizes it like this, GNU libc and GNU gettext will have the choice among (a) dropping the transliteration during charset conversion of the messages, (b) violating POSIX in this point. I will vote for (b). Namely, what GNU gettext does, is to allocate an iconv descriptor that allows transliteration. For example, when converting from ISO-8859-1 or UTF-8 to ASCII, it would transform "1 Empfänger" to"1 Empfaenger" (glibc in German locale) or"1 Empf\"anger" (GNU libiconv) Suggested wording change: "The conversion shall be performed as if by a call to iconv() using a conversion descriptor that converts from the returned to the , with an implementation-dependent conversion quality." Bruno
Re: Question regarding gettext behavior on iconv failure
Hi Eric, > The example in question set up several .po files and a specific > environment to test various pluralization/transcoding fallbacks, and > concludes with a snippet where a string with an encoding error in > ISO-8859-1 is output in spite of an iconv failure, rather than the > string passed in to ngettext(): > > > n_recipients = 1; > // The following outputs "1 Empfänger" encoded in UTF-8: > printf("%s\n", ngettext("recipient", "recipients", n_recipients)); > > bind_textdomain_codeset("mail", "ASCII"); > > n_recipients = 1; > // The following outputs "recipient" with the same encoding as the > "recipient" > // argument to ngettext (remember, the the system is assumed to not > support > // conversion from ISO/IEC 8859-1 to ASCII): > printf("%s\n", ngettext("recipient", "recipients", n_recipients)); > // On GNU gettext, "1 Empfänger" is output in ISO-8859-1 here (i.e. > no conversion is done). I think we already agreed on considering this > behavior a bug, I cannot reproduce this. Find attached my (complete) test case. GNU gettext uses iconv_open() with arguments that indicate that a not 1:1 conversion (e.g. transliteration) is better than a failure. The result thus depends on the iconv implementation. For GNU gettext the recommended iconv implementations are: - on glibc systems: GNU libc, - otherwise: GNU libiconv. Therefore here are the results on GNU libc (2.32) and on some other OS (FreeBSD 13) with GNU libiconv: With a mail.po that contains only umlauts: Output on glibc systems (e.g. 2.32): 1 Empfänger 1 Empfaenger Output on non-glibc systems with GNU libiconv: 1 Empfänger 1 Empf"anger With a mail-utf8.po that contains also Hanzi characters: Output on glibc systems (e.g. 2.32): 1 Empfänger Chinese (ä¸æ,æ®éè¯,æ±è¯) ä½ å¥½ 1 Empfaenger Chinese (??,???,??) ?? Output on non-glibc systems with GNU libiconv: 1 Empfänger Chinese (ä¸æ,æ®éè¯,æ±è¯) ä½ å¥½ recipient As you can see: * For the first line of output, since the output encoding is UTF-8, iconv() never needed transliteration and never failed. * For the second line of output, in the first three cases, iconv() did transliteration, and the result was always an ASCII string. (The quality of glibc's transliteration of Hanzi characters to question marks can be debated, though.) * In the last case, iconv() failed, and thus GNU gettext output the corresponding argument to ngettext() untranslated. > This raises a few questions: does the GNU gettext team agree that this > can be considered a bug No. Please provide a reproducible test case, that produces wrong results on an interesting platform. NetBSD 3.0 or IRIX 6.5, for example, don't count. Bruno /* Preparations: - Install locale named 'de_DE.UTF-8' (using localedef). - Find attached mail.po - $ mkdir -p de/LC_MESSAGES $ msgfmt -c -o de/LC_MESSAGES/mail.mo mail.po or $ msgfmt -c -o de/LC_MESSAGES/mail.mo mail-utf8.po - $ gcc -Wall foo.c - $ LC_ALL=de_DE.UTF-8 ./a.out */ #include #include #include int main () { if (setlocale (LC_ALL, "") == NULL) return 1; textdomain ("mail"); bindtextdomain ("mail", "."); unsigned int n_recipients; n_recipients = 1; // The following outputs "1 Empfänger" encoded in UTF-8: printf("%s\n", ngettext("recipient", "recipients", n_recipients)); bind_textdomain_codeset("mail", "ASCII"); n_recipients = 1; // The following outputs "recipient" with the same encoding as the "recipient" // argument to ngettext (remember, the the system is assumed to not support // conversion from ISO/IEC 8859-1 to ASCII): printf("%s\n", ngettext("recipient", "recipients", n_recipients)); // On GNU gettext, "1 Empfänger" is output in ISO-8859-1 here (i.e. no conversion is done). I think we already agreed on considering this behavior a bug, } /* With a mail.po that contains only umlauts: Output on glibc systems (e.g. 2.32): 1 Empfänger 1 Empfaenger Output on non-glibc systems with GNU libiconv: 1 Empfänger 1 Empf"anger With a mail-utf8.po that contains also Hanzi characters: Output on glibc systems (e.g. 2.32): 1 Empfänger Chinese (ä¸æ,æ®éè¯,æ±è¯) ä½ å¥½ 1 Empfaenger Chinese (??,???,??) ?? Output on non-glibc systems with GNU libiconv: 1 Empfänger Chinese (ä¸æ,æ®éè¯,æ±è¯) ä½ å¥½ recipient */ msgid "" msgstr "" "Content-Type: text/plain; charset=ISO_8859-1\n" "Plural-Forms: nplurals=4; plural= n==1?0: (n>1 && n< 5)?1: (n==0)? 2:3;\n" msgid "recipient" msgid_plural "recipients" msgstr[0] "1 Empfänger" msgstr[1] "2 bis 4 Empfänger" msgstr[2] "keine Empfänger" msgstr[3] "mehr als 4 Empfänger" msgid "" msgstr "" "Content-Type: text/plain; charset=UTF-8\n" "Plural-Forms: nplurals=4; plural= n==1?0: (n>1 && n< 5)?1: (n==0)? 2:3;\n" msgid "recipient" msgid_plural "recipients" msgstr[0] "1 Empfänger Chinese (ä¸æ,æ®éè¯,æ±è¯) ä½ å¥½" msgstr[1] "2 bis 4 Empfänger" msgstr[2]
Question regarding gettext behavior on iconv failure
Hello GNU gettext maintainers, In today's Austin Group meeting, we developed an example of using the proposed POSIX standardization of gettext() and encountered a situation where we felt that GNU gettext may have a bug. For context, the entire example is at: https://posix.rhansen.org/p/gettext_split The example in question set up several .po files and a specific environment to test various pluralization/transcoding fallbacks, and concludes with a snippet where a string with an encoding error in ISO-8859-1 is output in spite of an iconv failure, rather than the string passed in to ngettext(): n_recipients = 1; // The following outputs "1 Empfänger" encoded in UTF-8: printf("%s\n", ngettext("recipient", "recipients", n_recipients)); bind_textdomain_codeset("mail", "ASCII"); n_recipients = 1; // The following outputs "recipient" with the same encoding as the "recipient" // argument to ngettext (remember, the the system is assumed to not support // conversion from ISO/IEC 8859-1 to ASCII): printf("%s\n", ngettext("recipient", "recipients", n_recipients)); // On GNU gettext, "1 Empfänger" is output in ISO-8859-1 here (i.e. no conversion is done). I think we already agreed on considering this behavior a bug, This raises a few questions: does the GNU gettext team agree that this can be considered a bug, and if so, will a future gettext release behave differently? Or if it is intentional and not a bug, can you provide justification for the behavior as well as tweaks to the proposed standard wording for gettext requirements and the worked example? -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3226 Virtualization: qemu.org | libvirt.org