Re: Add Unicode Support to the DBI

Tim Bunce Tue, 08 Nov 2011 05:16:50 -0800

On Mon, Nov 07, 2011 at 01:37:38PM +0000, Martin J. Evans wrote:
> >
> >I didn't think I was going to make LPW but it seems I will now - although it 
> >has cost me big time leaving it until the last minute.


All your beers at LPW are on me!

> http://www.martin-evans.me.uk/node/121

Great work Martin. Many thanks.

I've some comments and suggestions for you...

It says "There is no single way across DBDs to enable Unicode support"
but doesn't define what "Unicode support" actually means. Clearly the
"Unicode support" of Oracle will be different to that of a CSV file.
So it seems that we need to be really clear about what we want.

I'd suggest...

1. Focus initially on categorising the capabilities of the databases.
    Specifically separating those that understand character encodings
    at one or more of column, table, schema, database level.
    Answer the questions:
        what "Unicode support" is this database capable of? [vague]
        are particular column data types or attributes needed?
        does the db have a session/connection encoding concept?
        does the db support binary data types.
        does the client api identify data encoding?
    A table summarizing this kind of info would be of great value.
    I think this is the most important kind of data we need to move
    forward with this topic.  I suspect we'll end up with a few clear
    levels of "unicode support" by databases that we can then focus on
    more clearly.

2. Try to make a data-driven common test script.
    It should fetch the length of the stored value, something like:
        CREATE TABLE t (c VARCHAR(10));
        INSERT INTO t VALUES (?)   <=  $sth->execute("\x{263A}") # simley
        SELECT LENGTH(c), c FROM t
    Fetching the LENGTH is important because it tells us if the DB is
    treating the value as Unicode.  The description of DBD::Unify, for
    example, doesn't clarify if the db itself regards the stored value
    as unicode or the underlying string of encoded bytes.
    Also probably best to avoid latin characters for this, I'd use
    something that always has a multi-byte encoding, like a simley face char.

3. Focus on placeholders initially.
    We can ponder utf8 in literal SQL later. That's a separate ball of mud.
    (I'd also ignore unicode table/column/db names. It's a much lower
    priority and may become clearer when other issues get resolved.)

4. Tests could report local LANG / LC_ALL env var value
    so when others report their results we'll have that context.

Thanks again. I've only given it a quick skim. I'll read it again before LPW.

Meanwhile, it would be great if people could contribute the info for #1.

Tim.

p.s. Using data_diff() http://search.cpan.org/~timb/DBI/DBI.pm#data_diff
would make the tests shorter.
    my $sample_string = "\x{263A}";
    ...
    print data_diff($sample_string, $returned_string);

Re: Add Unicode Support to the DBI

Reply via email to