[HACKERS] Windows and locales and UTF-8 (oh my)

Tom Lane Sat, 06 Oct 2007 10:56:19 -0700

I've been learning much more than I wanted to know about $SUBJECT
since putting in the src/port/chklocale.c code to try to enforce
that our database encoding matches the system locale settings.
There's an ongoing thread in -patches that's been focused on
getting reasonable behavior from the point of view of the Far
Eastern contingent:
http://archives.postgresql.org/pgsql-patches/2007-10/msg00031.php
(Some of that's been applied, but not the very latest proposals.)
Here's some more info from an off-list discussion with Dave Page:

------- Forwarded Messages

Date:    Fri, 05 Oct 2007 20:54:04 +0100
From:    Dave Page <[EMAIL PROTECTED]>
To:      Tom Lane <[EMAIL PROTECTED]>
Subject: Re: [CORE] 8.3beta1 Available ...

Dave Page wrote:
> Some further info on that - utf-8 on Windows is actually a
> pseudo-codepage (65001) which doesn't have NLS files, hence why we have
> to convert to utf-16 before sorting. Perhaps the utf-8/65001 name
> difference is the problem here. I'll knock up a quick test program when
> the kids have gone to bed.

So, my test prog (below) returns the following:

[EMAIL PROTECTED]:~$ ./setlc "English_United Kingdom.65001"
LC_COLLATE=English_United
Kingdom.65001;LC_CTYPE=C;LC_MONETARY=English_United
Kingdom.65001;LC_NUMERIC=English_United
Kingdom.65001;LC_TIME=English_United Kingdom.65001

So everything other than LC_CTYPE is acceptable in UTF-8 on Windows -
and we already handle LC_CTYPE for UTF-8 on Windows through our UTF-8 ->
UTF-16 conversions internally.

Can we change initdb to test against LC_TIME instead of LC_CTYPE perhaps?

Regards, Dave.

#include <locale.h>

main (int argc, char *argv[])
{
        char *lc;

        if (argc > 1)
                setlocale(LC_ALL, argv[1]);

        lc = setlocale(LC_ALL, NULL);
        printf("%s\n", lc);
}

------- Message 2

Date:    Fri, 05 Oct 2007 23:32:36 +0100
From:    Dave Page <[EMAIL PROTECTED]>
To:      Tom Lane <[EMAIL PROTECTED]>
Subject: Re: [CORE] 8.3beta1 Available ...

Tom Lane wrote:
> Dave Page <[EMAIL PROTECTED]> writes:
>> So, my test prog (below) returns the following:
> 
>> [EMAIL PROTECTED]:~$ ./setlc "English_United Kingdom.65001"
>> LC_COLLATE=English_United
>> Kingdom.65001;LC_CTYPE=C;LC_MONETARY=English_United
>> Kingdom.65001;LC_NUMERIC=English_United
>> Kingdom.65001;LC_TIME=English_United Kingdom.65001
> 
> That's just frickin' weird ... and a bit scary.  There is a fair amount
> of code in PG that checks for lc_ctype_is_c and does things differently;
> one wonders if that isn't going to get misled by this behavior.  (Hmm,
> maybe this explains some of the "upper/lower doesn't work" reports we've
> been getting??)  Are you sure all variants of Windows act that way?

All the ones we support afaict.

>> Can we change initdb to test against LC_TIME instead of LC_CTYPE perhaps?
> 
> Is there something in Windows that constrains them to be all the same?
> If not this proposal seems just plain wrong :-(  But in any case I'd
> feel more comfortable having it look at LC_COLLATE.

They can all be set independently - it's just that there's no UTF-7
(65000) or UTF-8 (65001) NLS files (http://shlimazl.nm.ru/eng/nls.htm)
defining them fully so Windows doesn't know any more than the characters
that are in both 'pseudo codepages'.

As a result, you can't set LC_CTYPE to .65001 because Windows knows it
can't handle ToUpper() or ToLower() etc. but you can use it to encode
messages and other text.

/D

------- End of Forwarded Messages

I am thinking that Dave's discovery explains some previously unsolved
bug reports, such as
http://archives.postgresql.org/pgsql-bugs/2007-05/msg00260.php
If Windows returns LC_CTYPE=C in a situation like this, then
the various single-byte-charset optimization paths that are enabled by
lc_ctype_is_c() would be mistakenly used, leading to misbehavior in
upper()/lower() and other places.  ISTM we had better hack
lc_ctype_is_c() so that on Windows (only), if the database encoding
is UTF-8 then it returns FALSE regardless of what setlocale says.

That still leaves me with a boatload of questions, though.  If we can't
trust LC_CTYPE as an indicator of the system charset, what can we trust?
In particular this seems to say that looking at LC_CTYPE for chklocale's
purposes is completely useless; what do we look at instead?

Another issue: is it possible to set, say, LC_MESSAGES and LC_TIME to
different codepages and if so what happens?  If that does enable
different bits of infrastructure to return incompatibly encoded strings,
seems we need a defense against that --- what should it be?

One bright spot is that this does seem to suggest a way to implement the
recommendation I made in the -patches thread: if we can't support the
encoding (codepage) used by the locale seen by initdb, we could try
stripping the codepage indicator (if any) and plastering on .65001
to get a UTF8-compatible locale name.  That'd only work on Windows
but that seems the platform where we're most likely to see unsupportable
default encodings.

Comments?  I don't have a Windows development environment so I'm not
in a position to take the lead on testing/fixing this sort of stuff.

                        regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

[HACKERS] Windows and locales and UTF-8 (oh my)

Reply via email to