On Nov 8, 2011, at 5:16 AM, Tim Bunce wrote:

> 1. Focus initially on categorising the capabilities of the databases.
>    Specifically separating those that understand character encodings
>    at one or more of column, table, schema, database level.
>    Answer the questions:
>        what "Unicode support" is this database capable of? [vague]
>        are particular column data types or attributes needed?
>        does the db have a session/connection encoding concept?
>        does the db support binary data types.
>        does the client api identify data encoding?
>    A table summarizing this kind of info would be of great value.
>    I think this is the most important kind of data we need to move
>    forward with this topic.  I suspect we'll end up with a few clear
>    levels of "unicode support" by databases that we can then focus on
>    more clearly.

+1. Yes, this should make things pretty clear.

> 2. Try to make a data-driven common test script.
>    It should fetch the length of the stored value, something like:
>        CREATE TABLE t (c VARCHAR(10));
>        INSERT INTO t VALUES (?)   <=  $sth->execute("\x{263A}") # simley
>        SELECT LENGTH(c), c FROM t
>    Fetching the LENGTH is important because it tells us if the DB is
>    treating the value as Unicode.  The description of DBD::Unify, for
>    example, doesn't clarify if the db itself regards the stored value
>    as unicode or the underlying string of encoded bytes.
>    Also probably best to avoid latin characters for this, I'd use
>    something that always has a multi-byte encoding, like a simley face char.

And something that doesn't have a variant that uses combining characters, so 
that the length should be consistent if it's treated as Unicode.

> 3. Focus on placeholders initially.
>    We can ponder utf8 in literal SQL later. That's a separate ball of mud.
>    (I'd also ignore unicode table/column/db names. It's a much lower
>    priority and may become clearer when other issues get resolved.)

+1, though good to know about. Just as important as placeholders, however, is 
fetching data.

> 4. Tests could report local LANG / LC_ALL env var value
>    so when others report their results we'll have that context.
> 
> Thanks again. I've only given it a quick skim. I'll read it again before LPW.
> 
> Meanwhile, it would be great if people could contribute the info for #1.
> 
> Tim.
> 
> p.s. Using data_diff() http://search.cpan.org/~timb/DBI/DBI.pm#data_diff
> would make the tests shorter.
>    my $sample_string = "\x{263A}";
>    ...
>    print data_diff($sample_string, $returned_string);

Can this be turned into a complete script we can all just run?

Thanks,

David



Reply via email to