On 03.06.2013 21:28, Tom Lane wrote:
Heikki Linnakangas<hlinnakan...@vmware.com>  writes:
He *is* using UTF-8. Or trying to, anyway :-). The downcasing in the
backend is supposed to leave bytes with the high-bit set alone, ie. in
UTF-8 encoding, it's supposed to leave ä and ß alone.

Well, actually, downcase_truncate_identifier() is doing this:

                unsigned char ch = (unsigned char) ident[i];

                if (ch>= 'A'&&  ch<= 'Z')
                        ch += 'a' - 'A';
                else if (IS_HIGHBIT_SET(ch)&&  isupper(ch))
                        ch = tolower(ch);

There's basically no way that that second case can give pleasant results
in a multibyte encoding, other than by not doing anything.

Hmph, I see.

I suspect
that Windows' libc has fewer defenses than other implementations and
performs some transformation that we don't get elsewhere.  This may also
explain the gripe yesterday in -general about funny results in OS X.

Can't really blame Windows on that. On Windows, we don't require that the encoding and LC_CTYPE's charset match. The OP used UTF-8 encoding in the server, but LC_CTYPE="English_United Kingdom.1252", ie. LC_CTYPE implies WIN1252 encoding. We allow that and it generally works on Windows because in varstr_cmp, we use MultiByteToWideChar() followed by wcscoll_l(), which doesn't care about the charset implied by LC_CTYPE. But for isupper(), it matters.

We talked about this before and went off into the weeds about whether
it was sensible to try to use towlower() and whether that wouldn't
create undesirably platform-sensitive results.  I wonder though if we
couldn't just fix this code to not do anything to high-bit-set bytes
in multibyte encodings.

Yeah, we should do that. It makes no sense to call isupper or tolower on bytes belonging to multi-byte characters.

- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to