I've been learning much more than I wanted to know about $SUBJECT since putting in the src/port/chklocale.c code to try to enforce that our database encoding matches the system locale settings. There's an ongoing thread in -patches that's been focused on getting reasonable behavior from the point of view of the Far Eastern contingent: http://archives.postgresql.org/pgsql-patches/2007-10/msg00031.php (Some of that's been applied, but not the very latest proposals.) Here's some more info from an off-list discussion with Dave Page:
------- Forwarded Messages Date: Fri, 05 Oct 2007 20:54:04 +0100 From: Dave Page <[EMAIL PROTECTED]> To: Tom Lane <[EMAIL PROTECTED]> Subject: Re: [CORE] 8.3beta1 Available ... Dave Page wrote: > Some further info on that - utf-8 on Windows is actually a > pseudo-codepage (65001) which doesn't have NLS files, hence why we have > to convert to utf-16 before sorting. Perhaps the utf-8/65001 name > difference is the problem here. I'll knock up a quick test program when > the kids have gone to bed. So, my test prog (below) returns the following: [EMAIL PROTECTED]:~$ ./setlc "English_United Kingdom.65001" LC_COLLATE=English_United Kingdom.65001;LC_CTYPE=C;LC_MONETARY=English_United Kingdom.65001;LC_NUMERIC=English_United Kingdom.65001;LC_TIME=English_United Kingdom.65001 So everything other than LC_CTYPE is acceptable in UTF-8 on Windows - and we already handle LC_CTYPE for UTF-8 on Windows through our UTF-8 -> UTF-16 conversions internally. Can we change initdb to test against LC_TIME instead of LC_CTYPE perhaps? Regards, Dave. #include <locale.h> main (int argc, char *argv[]) { char *lc; if (argc > 1) setlocale(LC_ALL, argv[1]); lc = setlocale(LC_ALL, NULL); printf("%s\n", lc); } ------- Message 2 Date: Fri, 05 Oct 2007 23:32:36 +0100 From: Dave Page <[EMAIL PROTECTED]> To: Tom Lane <[EMAIL PROTECTED]> Subject: Re: [CORE] 8.3beta1 Available ... Tom Lane wrote: > Dave Page <[EMAIL PROTECTED]> writes: >> So, my test prog (below) returns the following: > >> [EMAIL PROTECTED]:~$ ./setlc "English_United Kingdom.65001" >> LC_COLLATE=English_United >> Kingdom.65001;LC_CTYPE=C;LC_MONETARY=English_United >> Kingdom.65001;LC_NUMERIC=English_United >> Kingdom.65001;LC_TIME=English_United Kingdom.65001 > > That's just frickin' weird ... and a bit scary. There is a fair amount > of code in PG that checks for lc_ctype_is_c and does things differently; > one wonders if that isn't going to get misled by this behavior. (Hmm, > maybe this explains some of the "upper/lower doesn't work" reports we've > been getting??) Are you sure all variants of Windows act that way? All the ones we support afaict. >> Can we change initdb to test against LC_TIME instead of LC_CTYPE perhaps? > > Is there something in Windows that constrains them to be all the same? > If not this proposal seems just plain wrong :-( But in any case I'd > feel more comfortable having it look at LC_COLLATE. They can all be set independently - it's just that there's no UTF-7 (65000) or UTF-8 (65001) NLS files (http://shlimazl.nm.ru/eng/nls.htm) defining them fully so Windows doesn't know any more than the characters that are in both 'pseudo codepages'. As a result, you can't set LC_CTYPE to .65001 because Windows knows it can't handle ToUpper() or ToLower() etc. but you can use it to encode messages and other text. /D ------- End of Forwarded Messages I am thinking that Dave's discovery explains some previously unsolved bug reports, such as http://archives.postgresql.org/pgsql-bugs/2007-05/msg00260.php If Windows returns LC_CTYPE=C in a situation like this, then the various single-byte-charset optimization paths that are enabled by lc_ctype_is_c() would be mistakenly used, leading to misbehavior in upper()/lower() and other places. ISTM we had better hack lc_ctype_is_c() so that on Windows (only), if the database encoding is UTF-8 then it returns FALSE regardless of what setlocale says. That still leaves me with a boatload of questions, though. If we can't trust LC_CTYPE as an indicator of the system charset, what can we trust? In particular this seems to say that looking at LC_CTYPE for chklocale's purposes is completely useless; what do we look at instead? Another issue: is it possible to set, say, LC_MESSAGES and LC_TIME to different codepages and if so what happens? If that does enable different bits of infrastructure to return incompatibly encoded strings, seems we need a defense against that --- what should it be? One bright spot is that this does seem to suggest a way to implement the recommendation I made in the -patches thread: if we can't support the encoding (codepage) used by the locale seen by initdb, we could try stripping the codepage indicator (if any) and plastering on .65001 to get a UTF8-compatible locale name. That'd only work on Windows but that seems the platform where we're most likely to see unsupportable default encodings. Comments? I don't have a Windows development environment so I'm not in a position to take the lead on testing/fixing this sort of stuff. regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 5: don't forget to increase your free space map settings