Re: Add Unicode Support to the DBI

Martin J. Evans Tue, 08 Nov 2011 06:46:03 -0800

On 08/11/11 13:16, Tim Bunce wrote:

On Mon, Nov 07, 2011 at 01:37:38PM +0000, Martin J. Evans wrote:


I didn't think I was going to make LPW but it seems I will now - although it 
has cost me big time leaving it until the last minute.


All your beers at LPW are on me!

http://www.martin-evans.me.uk/node/121


Great work Martin. Many thanks.

I've some comments and suggestions for you...

It says "There is no single way across DBDs to enable Unicode support"
but doesn't define what "Unicode support" actually means. Clearly the
"Unicode support" of Oracle will be different to that of a CSV file.


Point taken.

So it seems that we need to be really clear about what we want.

I'd suggest...

1. Focus initially on categorising the capabilities of the databases.
     Specifically separating those that understand character encodings
     at one or more of column, table, schema, database level.
     Answer the questions:
         what "Unicode support" is this database capable of? [vague]
         are particular column data types or attributes needed?
         does the db have a session/connection encoding concept?
         does the db support binary data types.
         does the client api identify data encoding?
     A table summarizing this kind of info would be of great value.
     I think this is the most important kind of data we need to move
     forward with this topic.  I suspect we'll end up with a few clear
     levels of "unicode support" by databases that we can then focus on
     more clearly.

2. Try to make a data-driven common test script.


There is already one attached to the bottom of the post and referred to in the 
post - probably was not very clear.

     It should fetch the length of the stored value, something like:
         CREATE TABLE t (c VARCHAR(10));
         INSERT INTO t VALUES (?)<=  $sth->execute("\x{263A}") # simley
         SELECT LENGTH(c), c FROM t


It does that with a euro \x{20ac}

     Fetching the LENGTH is important because it tells us if the DB is
     treating the value as Unicode.


It doesn't do that but it checks what went in \x{20ac} is what comes out which 
is the same as checking the length since what goes in is a euro character. For 
a euro to come back out at the minimum the DBD would have to decode the data 
from the database.

It also does this with table and column names and table_info/column_info.

 The description of DBD::Unify, for
     example, doesn't clarify if the db itself regards the stored value
     as unicode or the underlying string of encoded bytes.
     Also probably best to avoid latin characters for this, I'd use
     something that always has a multi-byte encoding, like a simley face char.


I mostly do that but had problems creating tables and columns with a euro in 
the name (which I can sort of understand as I can believe there are more rules 
to valid table/column names like they must be alphabetic but could still be 
unicode). The code to create unicode table/column names uses \x{0100} which is

fredĀ (LATIN CAPITAL LETTER A WITH MACRON)

which I think is ok.

3. Focus on placeholders initially.
     We can ponder utf8 in literal SQL later. That's a separate ball of mud.
     (I'd also ignore unicode table/column/db names. It's a much lower
     priority and may become clearer when other issues get resolved.)


Yeah, I only added that because I knew in DBD::ODBC it did not work because the 
interface is char * and not Perl scalar.

4. Tests could report local LANG / LC_ALL env var value
     so when others report their results we'll have that context.


Yes, forgot that - will add it to the test.

Thanks again. I've only given it a quick skim. I'll read it again before LPW.


I will try and update it before then but a) it is my birthday tomorrow and b) 
I'm a bit under the weather at the moment.

Meanwhile, it would be great if people could contribute the info for #1.


I am happy to collect any such info and write it up so please at least cc me.

Tim.

p.s. Using data_diff() http://search.cpan.org/~timb/DBI/DBI.pm#data_diff
would make the tests shorter.
     my $sample_string = "\x{263A}";
     ...
     print data_diff($sample_string, $returned_string);


Yes, the output is a bit screwed because I was running the test code on Windows 
in a dos terminal and I've never got that working properly. Data::Dumper 
produces better output but I forgot data_diff - I will change where relevant.

BTW, the example code in each DBD was not really the test, it was just an 
example. My aim was to produce one script which ran to any DBD and that is 
attached to the end of the blog post. It's just it was too long to incorporate 
into a blog posting whereas the short simple examples were better. 
Unfortunately the differences between DBDs (and some limitations and bugs) is 
making the script quite complex in places e.g., one which caught me out was you 
have to set LongReadLen in DBD::Oracle before calling prepare but all other 
DBDs let you do it after prepare but before execute.

Martin
--
Martin J. Evans
Easysoft Limited
http://www.easysoft.com

Re: Add Unicode Support to the DBI

Reply via email to