Am 24.11.2015 um 17:38 schrieb D'Arcy J.M. Cain:
> It no longer segfaults. I am not sure that you needed to make such a
> drastic fix though. Did you consider casting to unsigned int? I
> suspect that the problem was chars > 127 being converted to negative
> numbers. The only negative number allowed to those macros is -1.
Yes, that's the root problem.
In German locale with UTF8, Postgresq outputs "34,25 \xe2\x82\xcac",
where the last three bytes together are the Euro character in UTF8
encoding (yes, it needs three bytes since it came late to the party).
Now Pygres goes through this string without awareness of the encoding,
it checks all three bytes with isdigit(). As you said, '\xac' casted to
int becomes negative (-84) and for whatever strange reasons, isdigit()
considers it a digit (strange because the other two negative bytes are
not considered digits, and because \xac and \xffac are not considered
digits in latin1 or unicode).
One solution is, as you say, to not cast to int, but to unsigned char,
which is what isdigit expects. Or to use -funsigned-char, but we should
not rely on that and also cast properly since the compiler flag may not
be supported on all platforms (it's probably a gcc thing only).
However, I think my solution is better because calling isdigit() is
unnecessary overhead. Remember it's a function call, not a macro, that
also takes the locale into account. So checking >= '0' && <= '9' is
faster, but moreover we want to be as restrictive as possible and not
have other characters considered digits because of whatever strange
interpretation of the locale. For instance, '\xb2' would be considered a
digit on Windows because it is a superscript 2 in cp1252.
You can still add the -funsigned-char, it cannot harm and should make
things a bit more deterministic.
-- Christoph
_______________________________________________
PyGreSQL mailing list
[email protected]
https://mail.vex.net/mailman/listinfo.cgi/pygresql