On 30.09.2009 15:39, Martin Evans wrote:
Does your setup pass the DBD::ODBC tests?
No, it does not:
t/40UnicodeRoundTrip.t
At least this test should pass without warnings and errors. If it
doesn't, the following Unicode tests do not make sense at all.
You are entering a world of pain.
Right. Unicode is too young in computer terms ... ;-)
And the various encodings and Unicode versions don't make things easier.
use encoding xxx
This is used in Perl to say your script is encoded in xxx. Just because
you have and accept UTF-8 encoded data does mean you need to "use
encoding" but if your script is encoded in xxx you need "use encoding
xxx". For instance, the example Hendrik gave you includes unicode
characters but does not need encoding. As a result, I cannot see how
adding "use encoding 'utf-8'" should make any difference to data
returned from sql server through DBD::ODBC.
It can make a difference, if you add "use encoding 'utf-8';" to a script
that is really encoded as iso-8859-1 or if you don't add it to a script
encoded as UTF-8 *and* the script contains non-ASCII string literals. In
both cases, you end with strings where encoding and UTF-8 flag do not match.
Example 1:
#!/usr/bin/perl -w
use strict;
use encoding "utf-8"; # but file is encoded as iso-8859-1
("ÄÖÜ" eq "\x{00C4}\x{00D6}\x{00DC}") or die "encoding mismatch";
# ^-- literal german umlauts, upper case, encoded as iso-8859-1
print "ok\n";
Output:
Malformed UTF-8 character (unexpected non-continuation byte 0xd6,
immediately after start byte 0xc4) at test.pl line 4.
Malformed UTF-8 character (unexpected non-continuation byte 0xdc,
immediately after start byte 0xd6) at test.pl line 4.
Malformed UTF-8 character (1 byte, need 2, after start byte 0xdc) at
test.pl line 4.
encoding mismatch at test.pl line 4.
Example 2:
#!/usr/bin/perl -w
use strict;
# no "use encoding "utf-8";", but file is encoded as UTF-8
("ÄÖÜ" eq "\x{00C4}\x{00D6}\x{00DC}") or die "encoding mismatch";
# ^-- literal german umlauts, upper case, encoded as UTF-8
print "ok\n";
Output:
encoding mismatch at test.pl line 4.
Note that Example 2 does not give you any warnings, as ISO-8859-1 does
not have any invalid byte sequences. Perl sees the left-hand side of eq
as a string literal containg six(!) characters encoded as ISO-8859-1
(those 6 bytes that encode ÄÖÜ in UTF-8), that literal has its UTF-8
flag turned off. The right-hand side is a string literal containing
three UTF-8 characters, internally stored as the same six bytes, but
with the UTF-8 flag turned on. A string of six characters cannot be the
same as a string of three characters, so the eq expression is false.
In Example 1, Perl sees three(!) bytes(!) in the string literal on the
left-hand side of eq that do not represend a valid UTF-8 string, hence
the three warnings. Still, the string has a length of three characters
and has its UTF-8 flag set. The right-hand side is the same as in
Example 2, but the binary junk is not equal to "ÄÖÜ", so again, the eq
expression is false.
t/40UnicodeRoundTrip.t is intentionally written using \x{0000} sequences
instead of non-ASCII literals to prevent this special problem. And it
has four paranoia tests (utf8::is_utf8(...) in the BEGIN block) to
absolutely make sure the test data has the UTF-8 flag set or cleared as
expected.
t/UChelp.pm has a dumpstr() function that dumps the unicode string in
pure ASCII using \x00 or \x{0000} sequences, including length and UTF-8
flag. It prevents the unwanted side effect of a UTF-8-capable terminal
that displays bytes written by Perl as Unicode characters, even if they
were ment to be non-unicode.
Alexander
--
Alexander Foken
mailto:alexan...@foken.de http://www.foken.de/alexander/