On Nov 8, 2011, at 5:16 AM, Tim Bunce wrote:
> 1. Focus initially on categorising the capabilities of the databases.
> Specifically separating those that understand character encodings
> at one or more of column, table, schema, database level.
> Answer the questions:
> what "Unicode support" is this database capable of? [vague]
> are particular column data types or attributes needed?
> does the db have a session/connection encoding concept?
> does the db support binary data types.
> does the client api identify data encoding?
> A table summarizing this kind of info would be of great value.
> I think this is the most important kind of data we need to move
> forward with this topic. I suspect we'll end up with a few clear
> levels of "unicode support" by databases that we can then focus on
> more clearly.
+1. Yes, this should make things pretty clear.
> 2. Try to make a data-driven common test script.
> It should fetch the length of the stored value, something like:
> CREATE TABLE t (c VARCHAR(10));
> INSERT INTO t VALUES (?) <= $sth->execute("\x{263A}") # simley
> SELECT LENGTH(c), c FROM t
> Fetching the LENGTH is important because it tells us if the DB is
> treating the value as Unicode. The description of DBD::Unify, for
> example, doesn't clarify if the db itself regards the stored value
> as unicode or the underlying string of encoded bytes.
> Also probably best to avoid latin characters for this, I'd use
> something that always has a multi-byte encoding, like a simley face char.
And something that doesn't have a variant that uses combining characters, so
that the length should be consistent if it's treated as Unicode.
> 3. Focus on placeholders initially.
> We can ponder utf8 in literal SQL later. That's a separate ball of mud.
> (I'd also ignore unicode table/column/db names. It's a much lower
> priority and may become clearer when other issues get resolved.)
+1, though good to know about. Just as important as placeholders, however, is
fetching data.
> 4. Tests could report local LANG / LC_ALL env var value
> so when others report their results we'll have that context.
>
> Thanks again. I've only given it a quick skim. I'll read it again before LPW.
>
> Meanwhile, it would be great if people could contribute the info for #1.
>
> Tim.
>
> p.s. Using data_diff() http://search.cpan.org/~timb/DBI/DBI.pm#data_diff
> would make the tests shorter.
> my $sample_string = "\x{263A}";
> ...
> print data_diff($sample_string, $returned_string);
Can this be turned into a complete script we can all just run?
Thanks,
David