[HACKERS] Re: [bug fix] strerror() returns ??? in a UTF-8/C database with LC_MESSAGES=non-ASCII

Noah Misch Mon, 09 Sep 2013 12:10:49 -0700

On Mon, Sep 09, 2013 at 08:29:58AM -0400, Peter Eisentraut wrote:
> On 9/6/13 10:37 AM, Tom Lane wrote:
> > BTW: personally, I would say that what you're looking at is a glibc bug.
> > I always thought the contract of gettext was to return the ASCII version
> > if it fails to produce a translated version.  That might not be what the
> > end user really wants to see, but surely returning something like "???"
> > is completely useless to anybody.
> 
> The question marks come from iconv.  Take a look at what this prints:
> 
> iconv po/ja.po -f utf-8 -t us-ascii//translit
> 
> If you use GNU libiconv, this will print a bunch of question marks.


Actually, GNU libiconv's iconv() decides that //translit is unimplementable
for some of the characters in that file, and it fails the conversion.  GNU
libc's iconv(), on the other hand, emits the question marks.

> I think the use of //translit by gettext is poor judgement, because my
> experiments show that the quality of the results is poor and not useful
> for a user interface.

It depends on the quality of the //translit implementation.  GNU libiconv's
seems pretty good.  It gives up for Japanese or Russian characters, so you get
the English messages.  For Polish, GNU libiconv transliterates like this:

msgstr "nie można usunąć pliku lub katalogu \"%s\": %s\n"
msgstr "nie mozna usuna'c pliku lub katalogu \"%s\": %s\n"

That's fair, considering what it has to work with.  Ideally, (a) GNU libc
should import the smarter transliteration code from GNU libiconv, and (b) GNU
gettext should check for weak //translit implementations and not use
//translit under such circumstances.

> My suggestion in this matter is to disable gettext processing when
> LC_CTYPE is set to C.  We could log a warning when LC_MESSAGES is set to
> something and LC_CTYPE is set to C.  Or just do the warning and keep
> logging.  Something like that.

In an ENCODING=UTF8, LC_CTYPE=C database, no transliteration should need to
happen, and no transliteration does happen for the PG messages.  I think
MauMau's original bind_textdomain_codeset() proposal was on the right track.
We would need to do that for every relevant 3rd-party message domain, though.
Ick.  This suggests to me that gettext really needs an API for overriding the
default codeset pertaining to message domains not subjected to
bind_textdomain_codeset().  In the meantime, adding bind_textdomain_codeset()
calls for known localized dependencies seems like a fine coping mechanism.

If we can reasonably detect when gettext is supplying useless ????? messages,
that's good, too.

Thanks,
nm

-- 
Noah Misch
EnterpriseDB                                 http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Re: [bug fix] strerror() returns ??? in a UTF-8/C database with LC_MESSAGES=non-ASCII

Reply via email to