On Nov 8, 2011, at 5:16 AM, Tim Bunce wrote: > 1. Focus initially on categorising the capabilities of the databases. > Specifically separating those that understand character encodings > at one or more of column, table, schema, database level. > Answer the questions: > what "Unicode support" is this database capable of? [vague] > are particular column data types or attributes needed? > does the db have a session/connection encoding concept? > does the db support binary data types. > does the client api identify data encoding? > A table summarizing this kind of info would be of great value. > I think this is the most important kind of data we need to move > forward with this topic. I suspect we'll end up with a few clear > levels of "unicode support" by databases that we can then focus on > more clearly.
+1. Yes, this should make things pretty clear. > 2. Try to make a data-driven common test script. > It should fetch the length of the stored value, something like: > CREATE TABLE t (c VARCHAR(10)); > INSERT INTO t VALUES (?) <= $sth->execute("\x{263A}") # simley > SELECT LENGTH(c), c FROM t > Fetching the LENGTH is important because it tells us if the DB is > treating the value as Unicode. The description of DBD::Unify, for > example, doesn't clarify if the db itself regards the stored value > as unicode or the underlying string of encoded bytes. > Also probably best to avoid latin characters for this, I'd use > something that always has a multi-byte encoding, like a simley face char. And something that doesn't have a variant that uses combining characters, so that the length should be consistent if it's treated as Unicode. > 3. Focus on placeholders initially. > We can ponder utf8 in literal SQL later. That's a separate ball of mud. > (I'd also ignore unicode table/column/db names. It's a much lower > priority and may become clearer when other issues get resolved.) +1, though good to know about. Just as important as placeholders, however, is fetching data. > 4. Tests could report local LANG / LC_ALL env var value > so when others report their results we'll have that context. > > Thanks again. I've only given it a quick skim. I'll read it again before LPW. > > Meanwhile, it would be great if people could contribute the info for #1. > > Tim. > > p.s. Using data_diff() http://search.cpan.org/~timb/DBI/DBI.pm#data_diff > would make the tests shorter. > my $sample_string = "\x{263A}"; > ... > print data_diff($sample_string, $returned_string); Can this be turned into a complete script we can all just run? Thanks, David