On 08/11/2011 17:53, David E. Wheeler wrote:
On Nov 8, 2011, at 5:16 AM, Tim Bunce wrote:

1. Focus initially on categorising the capabilities of the databases.
    Specifically separating those that understand character encodings
    at one or more of column, table, schema, database level.
    Answer the questions:
        what "Unicode support" is this database capable of? [vague]
        are particular column data types or attributes needed?
        does the db have a session/connection encoding concept?
        does the db support binary data types.
        does the client api identify data encoding?
    A table summarizing this kind of info would be of great value.
    I think this is the most important kind of data we need to move
    forward with this topic.  I suspect we'll end up with a few clear
    levels of "unicode support" by databases that we can then focus on
    more clearly.
+1. Yes, this should make things pretty clear.

2. Try to make a data-driven common test script.
    It should fetch the length of the stored value, something like:
        CREATE TABLE t (c VARCHAR(10));
        INSERT INTO t VALUES (?)<=  $sth->execute("\x{263A}") # simley
        SELECT LENGTH(c), c FROM t
    Fetching the LENGTH is important because it tells us if the DB is
    treating the value as Unicode.  The description of DBD::Unify, for
    example, doesn't clarify if the db itself regards the stored value
    as unicode or the underlying string of encoded bytes.
    Also probably best to avoid latin characters for this, I'd use
    something that always has a multi-byte encoding, like a simley face char.
And something that doesn't have a variant that uses combining characters, so 
that the length should be consistent if it's treated as Unicode.

3. Focus on placeholders initially.
    We can ponder utf8 in literal SQL later. That's a separate ball of mud.
    (I'd also ignore unicode table/column/db names. It's a much lower
    priority and may become clearer when other issues get resolved.)
+1, though good to know about. Just as important as placeholders, however, is 
fetching data.

4. Tests could report local LANG / LC_ALL env var value
    so when others report their results we'll have that context.

Thanks again. I've only given it a quick skim. I'll read it again before LPW.

Meanwhile, it would be great if people could contribute the info for #1.

Tim.

p.s. Using data_diff() http://search.cpan.org/~timb/DBI/DBI.pm#data_diff
would make the tests shorter.
    my $sample_string = "\x{263A}";
    ...
    print data_diff($sample_string, $returned_string);
Can this be turned into a complete script we can all just run?

Thanks,

David



I've just checked in unicode_test.pl to DBI's subversion trunk in /ex dir.

It won't run right now without changing the do_connect sub as you have to specify how to connect to the DB. Also, there is a DBD specific section at the start where you might have to add a DBD it does not know about (anything other than DBD::ODBC, DBD::Oracle, DBD::SQLite, DBD::CSV, DBD::mysql) if it needs something other than the defaults e.g., the name of the length function in SQL, the column type for unicode columns and binary columns, the setting to enable UTF8/Unicode support. It could be a bit of a pain if your DBD does not support type_info_all but I'm around on irc and in this list if anyone wants any help making it work.

It needs rather a lot of tidying up so I'm not putting it forward as code-of-the-year but it is a start.

BTW, you'll need Test::More::UTF8 and perhaps a couple of other non core modules to run it.

Martin

Reply via email to