Re: [HACKERS] UTF-8 encoding problem w/ libpq

Heikki Linnakangas Mon, 03 Jun 2013 11:43:25 -0700

On 03.06.2013 21:28, Tom Lane wrote:

Heikki Linnakangas<hlinnakan...@vmware.com>  writes:

He *is* using UTF-8. Or trying to, anyway :-). The downcasing in the
backend is supposed to leave bytes with the high-bit set alone, ie. in
UTF-8 encoding, it's supposed to leave Ã¤ and ÃŸ alone.


Well, actually, downcase_truncate_identifier() is doing this:

                unsigned char ch = (unsigned char) ident[i];

                if (ch>= 'A'&&  ch<= 'Z')
                        ch += 'a' - 'A';
                else if (IS_HIGHBIT_SET(ch)&&  isupper(ch))
                        ch = tolower(ch);

There's basically no way that that second case can give pleasant results
in a multibyte encoding, other than by not doing anything.


Hmph, I see.

I suspect
that Windows' libc has fewer defenses than other implementations and
performs some transformation that we don't get elsewhere.  This may also
explain the gripe yesterday in -general about funny results in OS X.

Can't really blame Windows on that. On Windows, we don't require thatthe encoding and LC_CTYPE's charset match. The OP used UTF-8 encoding inthe server, but LC_CTYPE="English_United Kingdom.1252", ie. LC_CTYPEimplies WIN1252 encoding. We allow that and it generally works onWindows because in varstr_cmp, we use MultiByteToWideChar() followed bywcscoll_l(), which doesn't care about the charset implied by LC_CTYPE.But for isupper(), it matters.

We talked about this before and went off into the weeds about whether
it was sensible to try to use towlower() and whether that wouldn't
create undesirably platform-sensitive results.  I wonder though if we
couldn't just fix this code to not do anything to high-bit-set bytes
in multibyte encodings.

Yeah, we should do that. It makes no sense to call isupper or tolower onbytes belonging to multi-byte characters.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UTF-8 encoding problem w/ libpq

Reply via email to