Re: Add Unicode Support to the DBI

Martin J. Evans Sun, 06 Nov 2011 08:51:27 -0800

On 05/10/2011 00:06, Jonathan Leffler wrote:

On Tue, Oct 4, 2011 at 15:24, Martin J. Evans<martin.ev...@easysoft.com>wrote:

On 04/10/2011 22:38, Tim Bunce wrote:

I've not had time to devote to this thread. Sorry.

I'd be grateful if someone could post a summary of it if/when it
approaches some kind of consensus.

  I don't think there is a "kind of consensus" right now (although some

useful discussion which probably will bear fruit) and I'd prefer to work out
what unicode support already exists and how it is implemented first. For
instance, Pg is very focussed on UTF-8 (as are most DBDs) and yet ODBC uses
UCS2 under the hood and CSV can use anything you like. Greg/David/Postgres
seem to have an immediate problem with unicode support in Postgres and I can
imagine they are keen to resolve it and I'd suggest they do it now in the
most appropriate way for DBD::Pg. I don't see why this should necessarily
impact on any discussion as to what DBI should_do/should_say as already the
DBDs which support unicode mostly do it in different ways.

I've started gathering together details of what unicode support there is in
DBDs, how it is implemented and what special flags there are to support it.
However, this is a massive task. So far I've done ODBC, Oracle, CSV, Unify,
mysql, SQLite, Firebird and sort of held off on Pg as I knew Greg was
working on it. Some might disagree but DB2 is a main one I no longer have
access to (please contact me if you use DBD::DB2 and are prepared to spare
half an hour or so to modify examples I have which verify unicode support).
Of course, if you use another DBD and can send me info on unicode support
I'd love to hear from you.

I thought the whole issue was an interesting topic and I had toyed with
doing a talk for LPW but to be honest, it is already taking a lot of time
and I have personal issues right now (and of course my $work) which mean my
time is severely limited so I'm doubtful right now if I could have it ready
in time as a talk. I might just post what I have gathered in a weeks time in
the hope I get a little more input in the mean time.



DBD::Informix has had a couple of UTF-8 patches sent and one has been
applied to the code.  The other arrived this morning and has to be
compared.  The attribute names chosen are different, but both contain 'ix_'
as a prefix and UTF-8 in some form or another.

I couldn't find a respository for Informix so I couldn't see what thesechanges were.

I've added your comments to my document - to be released tonight hopefully.

What I'm not sure about is how to test the code.  Creating an Informix
database that has UTF-8 data in it is trivial (well, nearly trivial).  The
difficulty is demonstrating where there were problems before and that the
problems are gone after.

If anyone has suggestions for how to show that UTF8 is working properly - in
the form of a fairly simple test case - I'd be very grateful to receive it
as guidance.

Attached to this post is some code I'm using however:

o it tries to use type_info_all to find relevant column types and so farDBD::SQLite is the only one not to support type_info_all. If you've gottype_info_all support try and make it work. You'll need to look for thefind_type calls (2 of them) and adjust to add informix SQL types.Alternatively, if you have not got type_info_all support you'll need toreplicate what I did for DBD::SQLite - see start of script.

The code attempts to work around the problem that some DBDs do notreport all possible columns back from type_info_all.

o few DBDs seem to support column_info_all - if you have not got it -that test is skipped

o you'll need an in.png PNG image which should preferably be smallerthan the size of your blob column type. It does not matter what it is -any old small png will do.


o no drivers so far support unicode parameter markers - what a surprise

o some drivers have issues with blobs and unicode data in the same table- DBD::CSV notably

o some drivers cannot create unicode table names or tables with unicodecolumn names e.g., DBD::mysql.

o no driver so far can do a table_info with a specified unicode (andexisting) table name and return the table info. I suspect this isbecause the table_info and column_info calls are in XS using char *instead of Perl scalars.


If you make this work for Informix please send it back to me.

Same applies to any other DBD maintainers.

So, DBD::Informix is endeavouring to move forward, knowing that the
underlying database layers (ESQL/C on the client, and the Informix data
server) handle UTF-8 OK, so any problems are in persuading Perl (and perhaps
DBI) to handle it appropriately too.

[...Did I hear a chorus of "nice theory - shame about the practice"?...]


Martin

#
# Test unicode in a DBD - written for DBD::ODBC but should work for other
# DBDs if you change the column types at the start of this script.
# To run properly it needs an in.png PNG image file in the local working
# directory but it does not matter what it is - probably best to keep it small.
#
# NOTE: will attempt to create tables called fred and
#       fredÄ (LATIN CAPITAL LETTER A WITH MACRON)
#
# NOTE: there are multiple ways of doing named parameter markers in DBDs.
# some do:
#   insert into sometable (a_column) values(:fred);
#   bind_param(':fred', x);
# some do:
#   insert into sometable (a_column) values(:fred);
#   bind_param('fred', x);
# This script does the latter - see unicode_param_markers
#
# DBD::ODBC currently fails:
# not ok 3 - unicode table found by qualified table_info
# not ok 6 - unicode column found by qualified column_info
# not ok 18 - bind parameter with unicode parameter marker
# All of which is documented in the DBD::ODBC pod. The first 2 are because
# table_info/column_info XS code uses char * instead of Perl scalars and
# the latter is because DBD::ODBC parses the SQL looking for placeholders
# and it does this as bytes not UTF-8 encoded strings.
#
use DBI qw(:sql_types);
use strict;
use warnings;
use Data::Dumper;
use utf8;
use Test::More;
use Test::More::UTF8;           # set utf8 mode on failure,out and todo handles
use Test::Exception;
use List::Util qw(first);
use Encode;

my $euro = "\x{20ac}";
# This script tries to guess the types for unicode columns and binary columns
# - it may fail. If it does set the types below or change the possible SQL
# types in the calls to find_types below. Also your png file needs to
# be smaller than the max size a blob can handle.
my $unicode_column_type;        # 'nvarchar for MS SQL Server'
my $blob_column_type;           # = 'image' for MS SQL Server
my $blob_bind_type;
my $in_png_file = 'in.png';

# you'll obviously have to change the following for other DBDs
my $h = DBI->connect("dbi:mysql:database=test", undef, undef, 
                         {RaiseError => 1 });
#my $h = DBI->connect('dbi:CSV:', undef, undef,
#                    {RaiseError => 1});
#my $h = DBI->connect("dbi:SQLite:dbname=test.db", '', '',
#                    {RaiseError => 1});
#my $h = DBI->connect("dbi:ODBC:DSN=asus2", undef, undef,
#                    {RaiseError => 1});

my $driver = $h->{Driver}->{Name};
note("Driver being used is $driver");

# the following sets the "magic" unicode/utf8 flag for each DBD
# and sets the column types for DBDs which do not support type_info_all
# which is pretty much all of them
if ($driver eq 'SQLite') {
    # does not support type_info_all
    $blob_column_type = 'blob';
    $blob_bind_type = SQL_BLOB;
    $unicode_column_type = 'varchar';
    $h->{sqlite_unicode} = 1;
} elsif ($driver eq 'CSV') {
    # does not support column_info
    #####$blob_column_type = 'blob';
    #####$blob_bind_type = SQL_BLOB;
    #####$unicode_column_type = 'varchar';
    $h->{f_encoding} = 'UTF8';
    $h->{f_ext} = '.csv';
} elsif ($driver eq 'mysql') {
    # does not support type_info_all
    $h->{mysql_enable_utf8} = 1;
    #####$blob_column_type = 'blob';
    #####$blob_bind_type = SQL_BLOB;
    #####$unicode_column_type = 'varchar';
} elsif ($driver eq 'ODBC') {
    # DBD::ODBC has type_info_all and column_info support
}
# read in.png file so we can see how big it is
open(my $ifh, "<:raw", $in_png_file);
my $png = do { local $/ = undef;<$ifh>  };
close $ifh;

if (!defined($blob_column_type)) {
    ($blob_column_type, $blob_bind_type) =
        find_type($h, [SQL_LONGVARBINARY, SQL_BINARY, SQL_VARBINARY], 
length($png));
}
BAIL_OUT("Could not find an image/blob type in type_info_all - you will need to 
change this script to specify the type") if !defined($blob_column_type);
if (!defined($unicode_column_type)) {
    ($unicode_column_type) = find_type($h, [SQL_WVARCHAR, SQL_VARCHAR]);
}
BAIL_OUT("Could not find a unicode type in type_info_all - you will need to 
change this script to specify the type") if !defined($unicode_column_type);


unicode_table($h);

unicode_column($h);

unicode_data($h);

mixed_lob_unicode_data($h);

unicode_param_markers($h);

done_testing;

sub drop_table {
    my ($h, $table) = @_;

    eval {
        local $h->{PrintError} = 0;
        my $s = $h->prepare(qq/drop table $table/);
        $s->execute;
    };
    #diag($@) if $@;
}

# create the named table with columns specified in $columns which is
# an arrayref with each element a hash of name and type
sub create_table {
    my ($h, $testmsg, $table, $columns) = @_;

    my $sql = qq/create table $table ( / .
        join(",", map {"$_->{name} $_->{type}"} @$columns) . ')';
    
    return lives_ok {
        my $s = $h->prepare($sql);
        $s->execute;
    } $testmsg;
}

sub unicode_table {
    my $h = shift;

    my $table = "fred\x{0100}";
    drop_table($h, $table);

    create_table($h, 'unicode table name supported', $table,
                 [{name => 'a', type => 'int'}]);
                 
    find_table($h, $table);

    drop_table($h, $table);
}

sub find_table {
    my ($h, $table) = @_;

    my $s = $h->table_info(undef, undef, undef, 'TABLE');
    my $r = $s->fetchall_arrayref;
    my $found = first {$_->[2] eq $table} @$r;
    ok($found, 'unicode table found in unqualified table_info');

    $s = $h->table_info(undef, undef, $table, 'TABLE');
    $r = $s->fetchall_arrayref;
    $found = first {$_->[2] eq $table} @$r;
    ok($found, 'unicode table found by qualified table_info');
}

sub find_column {
    my ($h, $table, $column) = @_;

    my $s = $h->column_info(undef, undef, $table, undef);
    if (!$s) {
        note("This driver does not seem to support column_info");
        note("Skipping this test");
        return;
    }
    my $r = $s->fetchall_arrayref;
    my $found = first {$_->[3] eq $column} @$r;
    ok($found, 'unicode column found in unqualified column_info');

    $s = $h->column_info(undef, undef, $table, $column);
    $r = $s->fetchall_arrayref;
    $found = first {$_->[3] eq $column} @$r;
    ok($found, 'unicode column found by qualified column_info');
}

sub unicode_column {
    my $h = shift;

    my $table = 'fred';
    my $column = "dave\x{0100}";

    drop_table($h, $table);

    create_table($h, 'unicode column name supported', $table,
                 [{name => $column, type => 'int'}]);

    find_column($h, $table, $column);

    drop_table($h, $table);
}

sub unicode_data {
    my $h = shift;

    my $table = 'fred';
    my $column = 'a';

    drop_table($h, $table);
    create_table($h, 'table for unicode data', $table,
                 [{name => $column, type => $unicode_column_type . "(20)"}]);

    lives_ok {
        my $s = $h->prepare(qq/insert into $table ($column) values (?)/);
        $s->execute($euro);
    } 'insert unicode data into table';

    my $s = $h->prepare(qq/select $column from $table/);
    $s->execute;
    my $r = $s->fetchall_arrayref;
    is($r->[0][0], $euro, 'unicode data out = unicode data in, no where');

    $s = $h->prepare(qq/select $column from $table where $column = ?/);
    $s->execute($euro);
    $r = $s->fetchall_arrayref;
    is(scalar(@$r), 1, 'select unicode data via parameterised where');

    $s = $h->prepare(qq/select $column from $table where $column = / . 
$h->quote($euro));
    $s->execute;
    $r = $s->fetchall_arrayref;
    is(scalar(@$r), 1, 'select unicode data via inline where');

    drop_table($h, $table);
}

sub mixed_lob_unicode_data {
    my $h = shift;

    my $table = 'fred';
    my $column1 = 'a';
    my $column2 = 'b';

    drop_table($h, $table);
    create_table($h, 'table for unicode data', $table,
                 [{name => $column1, type => $unicode_column_type . "(20)"},
                  {name => $column2, type => $blob_column_type}]);

    open(my $ifh, "<:raw", $in_png_file);
    my $png = do { local $/ = undef;<$ifh>  };
    close $ifh;

    lives_ok {
        my $s = $h->prepare(qq/insert into $table ($column1, $column2) values 
(?,?)/);
        $s->bind_param(1, $euro);
        $s->bind_param(2, $png, {TYPE => $blob_bind_type});
        #$s->execute($euro, $png);
        $s->execute;
    } 'insert unicode data and blob into table';

    my $s = $h->prepare(qq/select $column1, $column2 from $table/);
    $s->{LongReadLen} = length($png) * 2;
    $s->execute;
    my $r = $s->fetchall_arrayref;
    is($r->[0][0], $euro, 'unicode data out = unicode data in, no where with 
blob');
    ok(!Encode::is_utf8($r->[0][1]), 'utf8 flag not set on blob data');
    ok($png eq $r->[0][1], 'retrieved blob = inserted blob');

    drop_table($h, $table);
}

sub unicode_param_markers {
    my $h = shift;

    my $table = 'fred';
    drop_table($h, $table);

    create_table($h, 'test table for unicode parameter markers', $table,
                 [{name => 'a', type => 'int'}]);

    my $param_marker = "fred\x{20ac}";
    lives_ok {
        my $s = $h->prepare(qq/insert into $table (a) values (:$param_marker)/);
        $s->bind_param($param_marker, 1);
        $s->execute;
    } 'bind parameter with unicode parameter marker';

    drop_table($h, $table);
}

sub find_type {
    my ($h, $types, $minsize) = @_;


    my $r = $h->type_info_all;
    my $indexes = shift @$r;
    my $sql_type_idx = $indexes->{SQL_DATA_TYPE};
    my $type_name_idx = $indexes->{TYPE_NAME};
    my $column_size_idx = $indexes->{COLUMN_SIZE};

    if (!defined($sql_type_idx)) {
        note("type_info_all has no key for SQL_DATA_TYPE - falling back on 
DATA_TYPE");
        $sql_type_idx = $indexes->{DATA_TYPE};
    }
    if (!$column_size_idx) {
        note("type_info_all has no key for COLUMN_SIZE so not performing size 
checks");
    }

    BAIL_OUT("DBD does not seem to support type_info_all - you will need to 
edit this script to specify column types") if !$r || (scalar(@$r) == 0);

    foreach (@$r) {
        note("Found type $_->[$sql_type_idx] ($_->[$type_name_idx]) size=" . 
($column_size_idx ? $_->[$column_size_idx] : 'undef'));
        foreach my $type (@$types) {
            if ($_->[$sql_type_idx] eq $type) {
                if ((!defined($minsize)) || (!defined($column_size_idx)) ||
                    ($minsize && ($_->[$column_size_idx] > $minsize))) {
                    note("Found $type type which is $_->[$type_name_idx] and 
max size of " . ($column_size_idx ? $_->[$column_size_idx] : 'undef'));
                    return ($_->[$type_name_idx], $_->[$sql_type_idx]);
                } else {
                    note("$type type ($_->[$type_name_idx]) but the max length 
of $_->[$column_size_idx] is less than the required length $minsize");
                }
            }
        }
    }
}

Re: Add Unicode Support to the DBI

Reply via email to