Re: DBD::ODBC and character sets

Martin J. Evans Wed, 30 Sep 2009 13:05:28 -0700

Sorry for top posting but this thread is getting really long now and
getting difficult to follow.

Good point Alexander and noted but I still feel the issue is way beyond
whether UTF-8 encoded data is in the script or not as in my examples (as
in the ones you provided when you did the original unicode patch for
DBD::ODBC) use \x{xxyy} in which case use encoding does not come in to
it. The problem here is that DBD::ODBC, unixODBC and FreeTDS are all
providing some sort of translation of the data and it is difficult to
see where it is going wrong e.g., for an SQLPrepare call and since
freeTDS does not have the Wide APIs DBD::ODBC takes UTF-8 encoded data,
translates it to UTF-16, passes  it to unixODBC's SQLPrepareW which
spots freeTDS does not have wide APIs and translates it to UTF-8 then
passes it to SQLPrepare/SQLPrepareA and then freeTDS should translate it
to UCS-2 before passing to MS SQL Server. OK, in part of my example we
are talking bound data and unixODBC does not touch bound data (so Nick
of unixODBC says) and I suspect that is where the problem is - since
unixODBC does not touch bound data when it arrives in unixODBC it is
UTF-16 and so what do you set your client charset to in freeTDS (given
it is not expecting "Wide" unicode data.

Some further comments below.

Alexander Foken wrote:
> On 30.09.2009 15:39, Martin Evans wrote:
>>>> Does your setup pass the DBD::ODBC tests?
>>>>       
>>> No, it does not:
>>>
>>>     t/40UnicodeRoundTrip.t 
> At least this test should pass without warnings and errors. If it
> doesn't, the following Unicode tests do not make sense at all.
> 
> 
>> You are entering a world of pain. 
> 
> Right. Unicode is too young in computer terms ... ;-)
> And the various encodings and Unicode versions don't make things easier.
> 
>> use encoding xxx
>>
>> This is used in Perl to say your script is encoded in xxx. Just because
>> you have and accept UTF-8 encoded data does mean you need to "use
>> encoding" but if your script is encoded in xxx you need "use encoding
>> xxx". For instance, the example Hendrik gave you includes unicode
>> characters but does not need encoding. As a result, I cannot see how
>> adding "use encoding 'utf-8'" should make any difference to data
>> returned from sql server through DBD::ODBC.
>>   
> It can make a difference, if you add "use encoding 'utf-8';" to a script
> that is really encoded as iso-8859-1 or if you don't add it to a script
> encoded as UTF-8 *and* the script contains non-ASCII string literals. In
> both cases, you end with strings where encoding and UTF-8 flag do not
> match.

Agreed, it can BUT, DBD::ODBC uses the utf8 flag in Perl to determine
data passed to is UTF-8 encoded and then attempts to translate it to
UTF-16. If the data is not valid UTF-8 this translation should fail and
it should die - that is the way I converted your original patch for UNIX.

> Example 1:
> 
> #!/usr/bin/perl -w
> use strict;
> use encoding "utf-8"; # but file is encoded as iso-8859-1
> ("ÄÖÜ" eq "\x{00C4}\x{00D6}\x{00DC}") or die "encoding mismatch";
> # ^-- literal german umlauts, upper case, encoded as iso-8859-1
> print "ok\n";
> 
> Output:
> 
> Malformed UTF-8 character (unexpected non-continuation byte 0xd6,
> immediately after start byte 0xc4) at test.pl line 4.
> Malformed UTF-8 character (unexpected non-continuation byte 0xdc,
> immediately after start byte 0xd6) at test.pl line 4.
> Malformed UTF-8 character (1 byte, need 2, after start byte 0xdc) at
> test.pl line 4.
> encoding mismatch at test.pl line 4.

Something like that should happen in DBD::ODBC because the conversion
funtions (the equivalent of WideChrToMultiByteString or whatever it is
in Windows) are instructed to error on invalid encoding.

> Example 2:
> 
> #!/usr/bin/perl -w
> use strict;
> # no "use encoding "utf-8";", but file is encoded as UTF-8
> ("ÄÖÜ" eq "\x{00C4}\x{00D6}\x{00DC}") or die "encoding mismatch";
> # ^-- literal german umlauts, upper case, encoded as UTF-8
> print "ok\n";
> 
> Output:
> 
> encoding mismatch at test.pl line 4.
> 
> 
> Note that Example 2 does not give you any warnings, as ISO-8859-1 does
> not have any invalid byte sequences. Perl sees the left-hand side of eq
> as a string literal containg six(!) characters encoded as ISO-8859-1
> (those 6 bytes that encode ÄÖÜ in UTF-8), that literal has its UTF-8
> flag turned off. The right-hand side is a string literal containing
> three UTF-8 characters, internally stored as the same six bytes, but
> with the UTF-8 flag turned on. A string of six characters cannot be the
> same as a string of three characters, so the eq expression is false.
> 
> In Example 1, Perl sees three(!) bytes(!) in the string literal on the
> left-hand side of eq that do not represend a valid UTF-8 string, hence
> the three warnings. Still, the string has a length of three characters
> and has its UTF-8 flag set. The right-hand side is the same as in
> Example 2, but the binary junk is not equal to "ÄÖÜ", so again, the eq
> expression is false.
> 
> t/40UnicodeRoundTrip.t is intentionally written using \x{0000} sequences
> instead of non-ASCII literals to prevent this special problem. And it
> has four paranoia tests (utf8::is_utf8(...) in the BEGIN block) to
> absolutely make sure the test data has the UTF-8 flag set or cleared as
> expected.
> 
> t/UChelp.pm has a dumpstr() function that dumps the unicode string in
> pure ASCII using \x00 or \x{0000} sequences, including length and UTF-8
> flag. It prevents the unwanted side effect of a UTF-8-capable terminal
> that displays bytes written by Perl as Unicode characters, even if they
> were ment to be non-unicode.
> 
> 
> Alexander
> 
> 

Thanks for input - keep watching, and good luck in your job search
(assuming your perl monks home node is still up to date - Friar already
;-)).

Martin

Re: DBD::ODBC and character sets

Reply via email to