Re: [HACKERS] Windows and locales and UTF-8 (oh my)

Magnus Hagander Mon, 15 Oct 2007 02:14:27 -0700

On Sat, Oct 06, 2007 at 01:53:31PM -0400, Tom Lane wrote:
> I've been learning much more than I wanted to know about $SUBJECT
> since putting in the src/port/chklocale.c code to try to enforce
> that our database encoding matches the system locale settings.
> There's an ongoing thread in -patches that's been focused on
> getting reasonable behavior from the point of view of the Far
> Eastern contingent:
> http://archives.postgresql.org/pgsql-patches/2007-10/msg00031.php
> (Some of that's been applied, but not the very latest proposals.)
> Here's some more info from an off-list discussion with Dave Page:

Sorry for the late response to this. I missed the beginning and then got
mixed up in the different threads going aruond :-)

> Tom Lane wrote:
> > Dave Page <[EMAIL PROTECTED]> writes:
> >> So, my test prog (below) returns the following:
> > 
> >> [EMAIL PROTECTED]:~$ ./setlc "English_United Kingdom.65001"
> >> LC_COLLATE=English_United
> >> Kingdom.65001;LC_CTYPE=C;LC_MONETARY=English_United
> >> Kingdom.65001;LC_NUMERIC=English_United
> >> Kingdom.65001;LC_TIME=English_United Kingdom.65001
> > 
> > That's just frickin' weird ... and a bit scary.  There is a fair amount
> > of code in PG that checks for lc_ctype_is_c and does things differently;
> > one wonders if that isn't going to get misled by this behavior.  (Hmm,
> > maybe this explains some of the "upper/lower doesn't work" reports we've
> > been getting??)  Are you sure all variants of Windows act that way?
> 
> All the ones we support afaict.

AFICT, this has been standard behaviour in Windows since forever. Certainly
since Windows 2000 which is what we care about.

Windows 9x had different ways of dealing with it since they weren't native
UTF16 internally, but that doesn't matter to us here.

> >> Can we change initdb to test against LC_TIME instead of LC_CTYPE perhaps?
> > 
> > Is there something in Windows that constrains them to be all the same?
> > If not this proposal seems just plain wrong :-(  But in any case I'd
> > feel more comfortable having it look at LC_COLLATE.
> 
> They can all be set independently - it's just that there's no UTF-7
> (65000) or UTF-8 (65001) NLS files (http://shlimazl.nm.ru/eng/nls.htm)
> defining them fully so Windows doesn't know any more than the characters
> that are in both 'pseudo codepages'.
> 
> As a result, you can't set LC_CTYPE to .65001 because Windows knows it
> can't handle ToUpper() or ToLower() etc. but you can use it to encode
> messages and other text.

Yes. And also important, you can set LC_COLLATE to it, which will make all
the UTF16 versions of the functions behave properly.

Remember - all the Windows NT+ operations are UTF16 internally. So when you
set LC_TIME to it, for example, the API functions will generate the
resulting string in UTF16 and then convert it to whatever encoding you
chose - be it UTF8 or LATIN1 or whatever.

> I am thinking that Dave's discovery explains some previously unsolved
> bug reports, such as
> http://archives.postgresql.org/pgsql-bugs/2007-05/msg00260.php
> If Windows returns LC_CTYPE=C in a situation like this, then
> the various single-byte-charset optimization paths that are enabled by
> lc_ctype_is_c() would be mistakenly used, leading to misbehavior in
> upper()/lower() and other places.  ISTM we had better hack
> lc_ctype_is_c() so that on Windows (only), if the database encoding
> is UTF-8 then it returns FALSE regardless of what setlocale says.

Yes, I think we a change to that routine.

But. What about the case when we actually *have* locale=C and
encoding=UTF8. We need to care for that one somehow. Perhaps we should look
at LC_COLLATE instead (again, on Windows only. Possibly even only in the
windows+locale_returns_c+encoring=utf8 case, to distinguish these two)?

> That still leaves me with a boatload of questions, though.  If we can't
> trust LC_CTYPE as an indicator of the system charset, what can we trust?
> In particular this seems to say that looking at LC_CTYPE for chklocale's
> purposes is completely useless; what do we look at instead?

GetACP() returns the "ANSI Codepage", which I *think* is what we're looking
for here. 
http://msdn2.microsoft.com/en-us/library/ms776259.aspx

We should eb able to compare that to something?

> Another issue: is it possible to set, say, LC_MESSAGES and LC_TIME to
> different codepages and if so what happens?  If that does enable
> different bits of infrastructure to return incompatibly encoded strings,
> seems we need a defense against that --- what should it be?

AFAIK, yes, and then you get it back in the wrong encoding.
But as long as we set them to the same, we should be safe. And AFAIK, only
UTF8 (and UTF7, but we don't support that) is the special one we need to
care about.

> One bright spot is that this does seem to suggest a way to implement the
> recommendation I made in the -patches thread: if we can't support the
> encoding (codepage) used by the locale seen by initdb, we could try
> stripping the codepage indicator (if any) and plastering on .65001
> to get a UTF8-compatible locale name.  That'd only work on Windows
> but that seems the platform where we're most likely to see unsupportable
> default encodings.

Um, yes, that should work - assuming encoding is set to UTF8. We can't do
that for any other encoding, of course.

> Comments?  I don't have a Windows development environment so I'm not
> in a position to take the lead on testing/fixing this sort of stuff.

I have the Windows dev environment, but I feel like I'm on deep water
whenever I talk locale/encoding stuff really, I don''t know it as well as
I'd like to. But I'm happy to do coding and testing if I can get enough
pointers on whast I need to test :)

//Magnus

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
       choose an index scan if your joining column's datatypes do not
       match

Re: [HACKERS] Windows and locales and UTF-8 (oh my)

Reply via email to