On 08/11/2011 17:53, David E. Wheeler wrote:
On Nov 8, 2011, at 5:16 AM, Tim Bunce wrote:
1. Focus initially on categorising the capabilities of the databases.
Specifically separating those that understand character encodings
at one or more of column, table, schema, database level.
Answer the questions:
what "Unicode support" is this database capable of? [vague]
are particular column data types or attributes needed?
does the db have a session/connection encoding concept?
does the db support binary data types.
does the client api identify data encoding?
A table summarizing this kind of info would be of great value.
I think this is the most important kind of data we need to move
forward with this topic. I suspect we'll end up with a few clear
levels of "unicode support" by databases that we can then focus on
more clearly.
+1. Yes, this should make things pretty clear.
2. Try to make a data-driven common test script.
It should fetch the length of the stored value, something like:
CREATE TABLE t (c VARCHAR(10));
INSERT INTO t VALUES (?)<= $sth->execute("\x{263A}") # simley
SELECT LENGTH(c), c FROM t
Fetching the LENGTH is important because it tells us if the DB is
treating the value as Unicode. The description of DBD::Unify, for
example, doesn't clarify if the db itself regards the stored value
as unicode or the underlying string of encoded bytes.
Also probably best to avoid latin characters for this, I'd use
something that always has a multi-byte encoding, like a simley face char.
And something that doesn't have a variant that uses combining characters, so
that the length should be consistent if it's treated as Unicode.
3. Focus on placeholders initially.
We can ponder utf8 in literal SQL later. That's a separate ball of mud.
(I'd also ignore unicode table/column/db names. It's a much lower
priority and may become clearer when other issues get resolved.)
+1, though good to know about. Just as important as placeholders, however, is
fetching data.
4. Tests could report local LANG / LC_ALL env var value
so when others report their results we'll have that context.
Thanks again. I've only given it a quick skim. I'll read it again before LPW.
Meanwhile, it would be great if people could contribute the info for #1.
Tim.
p.s. Using data_diff() http://search.cpan.org/~timb/DBI/DBI.pm#data_diff
would make the tests shorter.
my $sample_string = "\x{263A}";
...
print data_diff($sample_string, $returned_string);
Can this be turned into a complete script we can all just run?
Thanks,
David
I've just checked in unicode_test.pl to DBI's subversion trunk in /ex dir.
It won't run right now without changing the do_connect sub as you have
to specify how to connect to the DB.
Also, there is a DBD specific section at the start where you might have
to add a DBD it does not know about (anything other than DBD::ODBC,
DBD::Oracle, DBD::SQLite, DBD::CSV, DBD::mysql) if it needs something
other than the defaults e.g., the name of the length function in SQL,
the column type for unicode columns and binary columns, the setting to
enable UTF8/Unicode support. It could be a bit of a pain if your DBD
does not support type_info_all but I'm around on irc and in this list if
anyone wants any help making it work.
It needs rather a lot of tidying up so I'm not putting it forward as
code-of-the-year but it is a start.
BTW, you'll need Test::More::UTF8 and perhaps a couple of other non core
modules to run it.
Martin