Re: [HACKERS] Mac OS: invalid byte sequence for encoding "UTF8"

Larry Rosenman Wed, 10 Feb 2016 14:40:39 -0800

On 2016-02-10 16:19, Tom Lane wrote:

I wrote:
Artur Zakirov <[email protected]> writes:
I think this is not a bug. It is a normal behavior. In Mac OSsscanf()with the %s format reads the string one character at a time. The sizeof
letter 'х' is 2. And sscanf() separate it into two wrong characters.
That argument might be convincing if OSX behaved that way for all
multibyte characters, but it doesn't seem to be doing that.  Why is
only 'х' affected?
I looked into the OS X sources, and found that indeed you are right:
*scanf processes the input a byte at a time, and applies isspace() to
each byte separately, even when the locale is such that that's aclearly
insane thing to do.  Since this code was derived from FreeBSD, FreeBSD
has or once had the same issue. (A look at the freebsd project ongithub
says it still does, assuming that's the authoritative repo.)  Not sure
about other BSDen.
I also verified that in UTF8-based locales, isspace() thinks that 0x85and0xA0, and no other high-bit-set values, are spaces. Not sure exactlywhyit thinks that, but that explains why 'х' fails when adjacent codepoints
don't.

So apparently the coding rule we have to adopt is "don't use *scanf()
on data that might contain multibyte characters". (There might becornercases where it'd work all right for conversion specifiers other than%s,but probably you might as well just use strtol and friends in suchcases.)
Ugh.

                        regards, tom lane

Definitive FreeBSD Sources:

https://svnweb.freebsd.org/base/


--
Larry Rosenman                     http://www.lerctr.org/~ler
Phone: +1 214-642-9640                 E-Mail: [email protected]
US Mail: 7011 W Parmer Ln, Apt 1115, Austin, TX 78729-6961


--
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Mac OS: invalid byte sequence for encoding "UTF8"

Reply via email to