Re: DBD::ODBC tricky unicode problem

Martin J. Evans Sun, 27 Oct 2013 05:21:34 -0700

I've snipped my original email and Tim's response and started againbased on Tim's comments in the email and on IRC. So


o this concentrates on input to the database and ignores output.

o only looks so far at inserting into varchars (and not nvarchars whichI currently believe works but we'll come back to that).o includes examples of different inputs, lengths in perl and what youget back.o test code and all discussion is windows only for now as ODBC driverson Unix can work in very different ways. Also, test code run to MS SQLServer which in my experience is one of the best ODBC drivers for Windows.o you cannot correctly translate chr data from a client to a serverusing the SQL Server ODBC driver if the client code page differs fromthe server code page - see http://support.microsoft.com/kb/234748.o MS quite clearly state "If your server must store data from multiplecode pages, the supported solution is to store the data in Unicodecolumns (NCHAR/NVARCHAR/NTEXT)." seehttp://support.microsoft.com/kb/234748 again.

o test code included at end of mail


A few things you should know:

1. D:O does not know what codepage you are in and there is nothing inODBC which allows it to find that out or even say it matters.2. I believe the TDS protocol (what MS SQL Server uses) allows a clientcode page to be sent, however I believe that is irrelevant now as theSQL Server driver converts your client code page chrs into unicodebefore sending them to SQL Server.3. You cannot expect to get back chrs you inserted into char/varcharcolumns if you did not turn on AutoTranslate in the driver (whichdefaults to on).4. D:O converts Perl unicode scalars to unicode on Windows usingMultiByteToWideChar and setting CP_UTF8.

ok, so diving into the example, here is hopefully fairly selfexplanatory output from the test code (code at end of email). I'veinterleaved my comments starting with #.


Current active console code page: 1252
DBD::ODBC build for unicode:1

Output connstr: DSN=asus2;UID=sa;PWD=easysoft;APP=Strawberry Perl(64-bit);WSID=ASUS2;Network=DBMSSOCN;Address=ASUS2\SQLEXPRESS,1433


Inserting a unicode euro, utf8 flag on:
  input string: €

data_string_desc of input string: UTF8 on, non-ASCII, 1 characters 3bytes

  ords of input string: 20ac,
  bytes of input string: e2,82,ac,
  database length: 3

data_string_desc of output string: UTF8 on, non-ASCII, 3 characters 7bytes

  ords of output string:e2,201a,ac,

# what happened here is SQLDecribeParam described the parameter asSQL_CHAR and that is what D:O bound it as. The driver converted 82 incode page 1252 to 201a (a magic quote) and the database sees this as 3 chrs# the probably mistake here is that D:O should have looked at the perldata, seen it was unicode and not bound it as SQL_CHAR (but SQL_WCHAR)even though SQLDescribeParam said it was


Inserting a UTF-8 encoded unicode euro, utf8 flag off:
"\x{0082}" does not map to cp1252 at test_code.pl line 36.
# the above resulted from a print to stdout in windows-1252
# you'd expect that since we were outputting bytes to the terminal
  input string: â\x{0082}¬

data_string_desc of input string: UTF8 off, non-ASCII, 3 characters 3bytes

  ords of input string: e2,82,ac,
  bytes of input string: e2,82,ac,
  database length: 3

data_string_desc of output string: UTF8 on, non-ASCII, 3 characters 7bytes

  ords of output string:e2,201a,ac,

# same thing as first case happened here except D:O couldn't have doneanything else as Perl said the scalar was not unicode and SQL Serversaid the parameter was SQL_CHAR


Inserting a unicode euro, utf8 flag on, forced SQL_WVARCHAR:
  input string: €

data_string_desc of input string: UTF8 on, non-ASCII, 1 characters 3bytes

  ords of input string: 20ac,
  bytes of input string: e2,82,ac,
  database length: 1

data_string_desc of output string: UTF8 on, non-ASCII, 1 characters 3bytes

  ords of output string:20ac,

# the above is arguably what should have happened in the first test casei.e., D:O should have bound as SQL_WVARCHAR because the perl data wasunicode. I included this one to show you what happens if you do it right.


Inserting a unicode U+187 which is not in the current code page:
"\x{0187}" does not map to cp1252 at test_code.pl line 36.
# you'd expect this warning here on printing to stdout
  input string: \x{0187}

data_string_desc of input string: UTF8 on, non-ASCII, 1 characters 2bytes

  ords of input string: 187,
  bytes of input string: c6,87,
  database length: 2

data_string_desc of output string: UTF8 on, non-ASCII, 2 characters 5bytes

  ords of output string:c6,2021,

# here again, down to D:O binding as SQL_CHAR, SQL Server interpretedthis as 2 chrs in the client code page, and converted them to chr c6 anda double dagger

Inserting a unicode U+187 which is not in the current code page withforced binding:

"\x{0187}" does not map to cp1252 at test_code.pl line 36.
  input string: \x{0187}

data_string_desc of input string: UTF8 on, non-ASCII, 1 characters 2bytes

  ords of input string: 187,
  bytes of input string: c6,87,
  database length: 1
  data_string_desc of output string: UTF8 off, ASCII, 1 characters 1 bytes
  ords of output string:3f,

# character U+187 won't fit in codepage 1252 so sql server converted itto a ?

So, I'm thinking the problem above is D:O ignores utf8 flag onparameters when they are bound and uses whatever SQLDescribeParam saysinstead (SQL_CHAR in case of varchar columns). If it didn't ignore theutf8 flag here, it would have to rebind on every execute (which it mayalready do, I didn't check).


test code:

=====
use 5.014;
use strict;
use warnings;
use DBI qw(:utils :sql_types);
use Encode qw(encode is_utf8);
use Win32::API;

binmode STDOUT, ":encoding(cp1252)";

sub show_it {
    my $h = shift;
    my $r = $h->selectrow_arrayref(q/select len(a), a from test/);
    say "  database length: ", $r->[0];
    say "  data_string_desc of output string: ", data_string_desc($r->[1]);
    print "  ords of output string:";
    foreach my $s(split(//, $r->[1])) {
        print sprintf("%x", ord($s)), ",";
    }
    print "\n";
    $h->do(q/delete from test/);
}

sub execute {
    my ($s, $string) = @_;

    my $bytes;
    if (is_utf8($string)) {
        $bytes = encode("UTF-8", $string);
    } else {
        $bytes = $string;
    }
    say "  input string: $string";
    say "  data_string_desc of input string: ", data_string_desc($string);
    print "  ords of input string: ";
    foreach my $s(split(//, $string)) {
        print sprintf("%x,", ord($s));
    }
    print "\n";

    print "  bytes of input string: ";
    foreach my $s(split(//, $bytes)) {
        print sprintf("%x,", ord($s));
    }
    print "\n";

    $s->execute($string);
}

sub set_codepage {
    # get active codepage and ensure it is cp1252

#http://stackoverflow.com/questions/1259084/what-encoding-code-page-is-cmd-exe-using

    Win32::API::More->Import("kernel32", "UINT GetConsoleOutputCP()");
    my $cp = GetConsoleOutputCP();
    print "Current active console code page: $cp\n";
    if ($cp != 1252) {
        print "Please change to codepage 1252 - run chcp 1252\n";

die "Incompatible active codepage - please change to codepage1252 by running chcp 1252\n";

    }
}

set_codepage();

my $h = DBI->connect();
say "DBD::ODBC build for unicode:", $h->{odbc_has_unicode};
say "Output connstr: ", $h->{odbc_out_connect_string};

my $s;
my $sql = q/insert into test (a) values(?)/;

eval {$h->do(q/drop table test/)};
$h->do(q/create table test (a varchar(100) collate Latin1_General_CI_AS)/);

# a simple unicode string
my $euro = "\x{20ac}";
say "Inserting a unicode euro, utf8 flag on:";
$s = $h->prepare($sql); # redo to ensure no sticky params
execute($s, $euro);
show_it($h);

# a simple unicode string first encoded in UTF-8
my $enc = encode("UTF-8", $euro);
say "Inserting a UTF-8 encoded unicode euro, utf8 flag off:";
$s = $h->prepare($sql); # redo to ensure no sticky params
execute($s, $enc);
show_it($h);

# a simple unicode string forced to be sent as SQL_WVARCHAR
say "Inserting a unicode euro, utf8 flag on, forced SQL_WVARCHAR:";
$s = $h->prepare($sql); # redo to ensure no sticky params
$s->bind_param(1, undef, {TYPE => SQL_WVARCHAR});
execute($s, $euro);
show_it($h);

# a unicode string containing a character that is not in the column codepage
my $question = "\x{187}";
say "Inserting a unicode U+187 which is not in the current code page:";
$s = $h->prepare($sql); # redo to ensure no sticky params
execute($s, $question);
show_it($h);

# a unicode string containing a character that is not in the columncodepage but forced binding

my $question = "\x{187}";

say "Inserting a unicode U+187 which is not in the current code pagewith forced binding:";

$s = $h->prepare($sql); # redo to ensure no sticky params
$s->bind_param(1, undef, {TYPE => SQL_WVARCHAR});
execute($s, $question);
show_it($h);

$h->disconnect;
=====

Martin
--
Martin J. Evans
Wetherby, UK

Re: DBD::ODBC tricky unicode problem

Reply via email to