Re: Add Unicode Support to the DBI

2011-11-08 Thread Tim Bunce
On Mon, Nov 07, 2011 at 01:37:38PM +, Martin J. Evans wrote:
 
 I didn't think I was going to make LPW but it seems I will now - although it 
 has cost me big time leaving it until the last minute.

All your beers at LPW are on me!

 http://www.martin-evans.me.uk/node/121

Great work Martin. Many thanks.

I've some comments and suggestions for you...

It says There is no single way across DBDs to enable Unicode support
but doesn't define what Unicode support actually means. Clearly the
Unicode support of Oracle will be different to that of a CSV file.
So it seems that we need to be really clear about what we want.

I'd suggest...

1. Focus initially on categorising the capabilities of the databases.
Specifically separating those that understand character encodings
at one or more of column, table, schema, database level.
Answer the questions:
what Unicode support is this database capable of? [vague]
are particular column data types or attributes needed?
does the db have a session/connection encoding concept?
does the db support binary data types.
does the client api identify data encoding?
A table summarizing this kind of info would be of great value.
I think this is the most important kind of data we need to move
forward with this topic.  I suspect we'll end up with a few clear
levels of unicode support by databases that we can then focus on
more clearly.

2. Try to make a data-driven common test script.
It should fetch the length of the stored value, something like:
CREATE TABLE t (c VARCHAR(10));
INSERT INTO t VALUES (?)   =  $sth-execute(\x{263A}) # simley
SELECT LENGTH(c), c FROM t
Fetching the LENGTH is important because it tells us if the DB is
treating the value as Unicode.  The description of DBD::Unify, for
example, doesn't clarify if the db itself regards the stored value
as unicode or the underlying string of encoded bytes.
Also probably best to avoid latin characters for this, I'd use
something that always has a multi-byte encoding, like a simley face char.

3. Focus on placeholders initially.
We can ponder utf8 in literal SQL later. That's a separate ball of mud.
(I'd also ignore unicode table/column/db names. It's a much lower
priority and may become clearer when other issues get resolved.)

4. Tests could report local LANG / LC_ALL env var value
so when others report their results we'll have that context.

Thanks again. I've only given it a quick skim. I'll read it again before LPW.

Meanwhile, it would be great if people could contribute the info for #1.

Tim.

p.s. Using data_diff() http://search.cpan.org/~timb/DBI/DBI.pm#data_diff
would make the tests shorter.
my $sample_string = \x{263A};
...
print data_diff($sample_string, $returned_string);


Re: Add Unicode Support to the DBI

2011-11-08 Thread H.Merijn Brand
On Tue, 8 Nov 2011 13:16:17 +, Tim Bunce tim.bu...@pobox.com
wrote:

 On Mon, Nov 07, 2011 at 01:37:38PM +, Martin J. Evans wrote:
  
   I didn't think I was going to make LPW but it seems I will now -
   although it has cost me big time leaving it until the last minute.  
 
 All your beers at LPW are on me!
 
  http://www.martin-evans.me.uk/node/121  
 
 Great work Martin. Many thanks.

I more and more regret I cannot be there.
Please, beside doing useful stuff, have FUN!

-- 
H.Merijn Brand  http://tux.nl  Perl Monger  http://amsterdam.pm.org/
using 5.00307 through 5.14 and porting perl5.15.x on HP-UX 10.20, 11.00,
11.11, 11.23 and 11.31, OpenSuSE 10.1, 11.0 .. 11.4 and AIX 5.2 and 5.3.
http://mirrors.develooper.com/hpux/   http://www.test-smoke.org/
http://qa.perl.org  http://www.goldmark.org/jeff/stupid-disclaimers/


Re: Add Unicode Support to the DBI

2011-11-08 Thread Martin J. Evans

On 08/11/11 13:16, Tim Bunce wrote:

On Mon, Nov 07, 2011 at 01:37:38PM +, Martin J. Evans wrote:


I didn't think I was going to make LPW but it seems I will now - although it 
has cost me big time leaving it until the last minute.


All your beers at LPW are on me!


http://www.martin-evans.me.uk/node/121


Great work Martin. Many thanks.

I've some comments and suggestions for you...

It says There is no single way across DBDs to enable Unicode support
but doesn't define what Unicode support actually means. Clearly the
Unicode support of Oracle will be different to that of a CSV file.


Point taken.


So it seems that we need to be really clear about what we want.

I'd suggest...

1. Focus initially on categorising the capabilities of the databases.
 Specifically separating those that understand character encodings
 at one or more of column, table, schema, database level.
 Answer the questions:
 what Unicode support is this database capable of? [vague]
 are particular column data types or attributes needed?
 does the db have a session/connection encoding concept?
 does the db support binary data types.
 does the client api identify data encoding?
 A table summarizing this kind of info would be of great value.
 I think this is the most important kind of data we need to move
 forward with this topic.  I suspect we'll end up with a few clear
 levels of unicode support by databases that we can then focus on
 more clearly.




2. Try to make a data-driven common test script.


There is already one attached to the bottom of the post and referred to in the 
post - probably was not very clear.


 It should fetch the length of the stored value, something like:
 CREATE TABLE t (c VARCHAR(10));
 INSERT INTO t VALUES (?)=  $sth-execute(\x{263A}) # simley
 SELECT LENGTH(c), c FROM t


It does that with a euro \x{20ac}


 Fetching the LENGTH is important because it tells us if the DB is
 treating the value as Unicode.


It doesn't do that but it checks what went in \x{20ac} is what comes out which 
is the same as checking the length since what goes in is a euro character. For 
a euro to come back out at the minimum the DBD would have to decode the data 
from the database.

It also does this with table and column names and table_info/column_info.


 The description of DBD::Unify, for
 example, doesn't clarify if the db itself regards the stored value
 as unicode or the underlying string of encoded bytes.
 Also probably best to avoid latin characters for this, I'd use
 something that always has a multi-byte encoding, like a simley face char.


I mostly do that but had problems creating tables and columns with a euro in 
the name (which I can sort of understand as I can believe there are more rules 
to valid table/column names like they must be alphabetic but could still be 
unicode). The code to create unicode table/column names uses \x{0100} which is

fredĀ (LATIN CAPITAL LETTER A WITH MACRON)

which I think is ok.


3. Focus on placeholders initially.
 We can ponder utf8 in literal SQL later. That's a separate ball of mud.
 (I'd also ignore unicode table/column/db names. It's a much lower
 priority and may become clearer when other issues get resolved.)


Yeah, I only added that because I knew in DBD::ODBC it did not work because the 
interface is char * and not Perl scalar.
 

4. Tests could report local LANG / LC_ALL env var value
 so when others report their results we'll have that context.


Yes, forgot that - will add it to the test.
 

Thanks again. I've only given it a quick skim. I'll read it again before LPW.


I will try and update it before then but a) it is my birthday tomorrow and b) 
I'm a bit under the weather at the moment.

 

Meanwhile, it would be great if people could contribute the info for #1.


I am happy to collect any such info and write it up so please at least cc me.
 

Tim.

p.s. Using data_diff() http://search.cpan.org/~timb/DBI/DBI.pm#data_diff
would make the tests shorter.
 my $sample_string = \x{263A};
 ...
 print data_diff($sample_string, $returned_string);


Yes, the output is a bit screwed because I was running the test code on Windows 
in a dos terminal and I've never got that working properly. Data::Dumper 
produces better output but I forgot data_diff - I will change where relevant.

BTW, the example code in each DBD was not really the test, it was just an 
example. My aim was to produce one script which ran to any DBD and that is 
attached to the end of the blog post. It's just it was too long to incorporate 
into a blog posting whereas the short simple examples were better. 
Unfortunately the differences between DBDs (and some limitations and bugs) is 
making the script quite complex in places e.g., one which caught me out was you 
have to set LongReadLen in DBD::Oracle before calling prepare but all other 
DBDs let you do 

Re: Add Unicode Support to the DBI

2011-11-08 Thread Tim Bunce
On Tue, Nov 08, 2011 at 02:45:39PM +, Martin J. Evans wrote:
 On 08/11/11 13:16, Tim Bunce wrote:
 On Mon, Nov 07, 2011 at 01:37:38PM +, Martin J. Evans wrote:
 
 2. Try to make a data-driven common test script.
 
 There is already one attached to the bottom of the post and referred to in 
 the post - probably was not very clear.

Ah, it's in the middle, in the DBDs section.
Could you put it in github or the DBI repo?
I may find some time to hack on it.

  It should fetch the length of the stored value, something like:
  CREATE TABLE t (c VARCHAR(10));
  INSERT INTO t VALUES (?)=  $sth-execute(\x{263A}) # simley
  SELECT LENGTH(c), c FROM t
 
 It does that with a euro \x{20ac}
 
  Fetching the LENGTH is important because it tells us if the DB is
  treating the value as Unicode.
 
 It doesn't do that but it checks what went in \x{20ac} is what comes
 out which is the same as checking the length since what goes in is a
 euro character. For a euro to come back out at the minimum the DBD
 would have to decode the data from the database.

It's not only important that what goes in comes back out again, but also
that what goes in is interpreted by the database in the right way.
Otherwise there are lots of more subtle issues like sorting bugs caused
by the database 'seeing' encoded bytes instead of unicode characters.
(Hence my point about the description for DBD::Unify being incomplete.)

It's _possible_ for a db to store a euro as a single byte. So it's
possible for the test to yield a false positive.  I prefer using a
simley because it's not possible to store it in a single byte.
So if LENGTH(c) returns 1 we can be very confident that the db is
interpreting the data correctly.

 Thanks again. I've only given it a quick skim. I'll read it again before LPW.
 
 I will try and update it before then but a) it is my birthday tomorrow and b) 
 I'm a bit under the weather at the moment.

HAPPY BIRTHDAY!
Get well soon!

 Meanwhile, it would be great if people could contribute the info for #1.
 
 I am happy to collect any such info and write it up so please at least cc me.

 p.s. Using data_diff() http://search.cpan.org/~timb/DBI/DBI.pm#data_diff
 would make the tests shorter.
  my $sample_string = \x{263A};
  ...
  print data_diff($sample_string, $returned_string);
 
 Yes, the output is a bit screwed because I was running the test code
 on Windows in a dos terminal and I've never got that working properly.

Ouch.

 Data::Dumper produces better output but I forgot data_diff - I will
 change where relevant.
 
 BTW, the example code in each DBD was not really the test, it was just
 an example. My aim was to produce one script which ran to any DBD and
 that is attached to the end of the blog post. It's just it was too
 long to incorporate into a blog posting whereas the short simple
 examples were better.

Yeap. Good idea. I'd simply missed the link when I skimmed it.

Thanks again for driving this forward Martin.

Tim.


Re: Add Unicode Support to the DBI

2011-11-08 Thread David E. Wheeler
On Nov 8, 2011, at 5:16 AM, Tim Bunce wrote:

 1. Focus initially on categorising the capabilities of the databases.
Specifically separating those that understand character encodings
at one or more of column, table, schema, database level.
Answer the questions:
what Unicode support is this database capable of? [vague]
are particular column data types or attributes needed?
does the db have a session/connection encoding concept?
does the db support binary data types.
does the client api identify data encoding?
A table summarizing this kind of info would be of great value.
I think this is the most important kind of data we need to move
forward with this topic.  I suspect we'll end up with a few clear
levels of unicode support by databases that we can then focus on
more clearly.

+1. Yes, this should make things pretty clear.

 2. Try to make a data-driven common test script.
It should fetch the length of the stored value, something like:
CREATE TABLE t (c VARCHAR(10));
INSERT INTO t VALUES (?)   =  $sth-execute(\x{263A}) # simley
SELECT LENGTH(c), c FROM t
Fetching the LENGTH is important because it tells us if the DB is
treating the value as Unicode.  The description of DBD::Unify, for
example, doesn't clarify if the db itself regards the stored value
as unicode or the underlying string of encoded bytes.
Also probably best to avoid latin characters for this, I'd use
something that always has a multi-byte encoding, like a simley face char.

And something that doesn't have a variant that uses combining characters, so 
that the length should be consistent if it's treated as Unicode.

 3. Focus on placeholders initially.
We can ponder utf8 in literal SQL later. That's a separate ball of mud.
(I'd also ignore unicode table/column/db names. It's a much lower
priority and may become clearer when other issues get resolved.)

+1, though good to know about. Just as important as placeholders, however, is 
fetching data.

 4. Tests could report local LANG / LC_ALL env var value
so when others report their results we'll have that context.
 
 Thanks again. I've only given it a quick skim. I'll read it again before LPW.
 
 Meanwhile, it would be great if people could contribute the info for #1.
 
 Tim.
 
 p.s. Using data_diff() http://search.cpan.org/~timb/DBI/DBI.pm#data_diff
 would make the tests shorter.
my $sample_string = \x{263A};
...
print data_diff($sample_string, $returned_string);

Can this be turned into a complete script we can all just run?

Thanks,

David





Re: Add Unicode Support to the DBI

2011-11-08 Thread Martin J. Evans

On 08/11/2011 13:16, Tim Bunce wrote:

On Mon, Nov 07, 2011 at 01:37:38PM +, Martin J. Evans wrote:

I didn't think I was going to make LPW but it seems I will now - although it 
has cost me big time leaving it until the last minute.

All your beers at LPW are on me!


http://www.martin-evans.me.uk/node/121

Great work Martin. Many thanks.

I've some comments and suggestions for you...

It says There is no single way across DBDs to enable Unicode support
but doesn't define what Unicode support actually means. Clearly the
Unicode support of Oracle will be different to that of a CSV file.
So it seems that we need to be really clear about what we want.

I'd suggest...

1. Focus initially on categorising the capabilities of the databases.
 Specifically separating those that understand character encodings
 at one or more of column, table, schema, database level.
 Answer the questions:
 what Unicode support is this database capable of? [vague]
 are particular column data types or attributes needed?
 does the db have a session/connection encoding concept?
 does the db support binary data types.
 does the client api identify data encoding?
 A table summarizing this kind of info would be of great value.
 I think this is the most important kind of data we need to move
 forward with this topic.  I suspect we'll end up with a few clear
 levels of unicode support by databases that we can then focus on
 more clearly.
this will take some time but I'll work on it. It could be a nightmare 
for DBD::ODBC as it supports dozens of dbs so I'll have to be a bit 
cagey about the stuff like what Unicode support is this database 
capable of.


Contributions welcome.


2. Try to make a data-driven common test script.
 It should fetch the length of the stored value, something like:
 CREATE TABLE t (c VARCHAR(10));
 INSERT INTO t VALUES (?)=  $sth-execute(\x{263A}) # simley
 SELECT LENGTH(c), c FROM t
 Fetching the LENGTH is important because it tells us if the DB is
 treating the value as Unicode.  The description of DBD::Unify, for
 example, doesn't clarify if the db itself regards the stored value
 as unicode or the underlying string of encoded bytes.


Changed to add length(c) test - sorry misread that first time as Perl 
code length(c) not SQL - I'm not well enough to think so clearly right now.


Will take some time to run through the DBDs to check whether it is 
length()/len() or something else - isn't SQL great
e.g., DBD::CSV (SQL::Statement) is char_length and most other SQLs are 
len or length.


BTW, DBD::mysql fails the length test using the length function 
char_length (documented as Return number of characters in argument) - it 
returns 3 for the smiley not 1, not investigated why yet.



 Also probably best to avoid latin characters for this, I'd use
 something that always has a multi-byte encoding, like a simley face char.
changed to use smiley face for data but not for table/column name checks 
as most dbs have other rules on table/column names.



3. Focus on placeholders initially.
 We can ponder utf8 in literal SQL later. That's a separate ball of mud.
 (I'd also ignore unicode table/column/db names. It's a much lower
 priority and may become clearer when other issues get resolved.)
test already does that so left in - it is easy to comment it out - one 
line change.

4. Tests could report local LANG / LC_ALL env var value
 so when others report their results we'll have that context.
added although I stopped short of outputting everything in %ENV as I 
guessed people would end up having to edit it to remove stuff - it does 
LANG, LC_* and NLS_* right now.



Thanks again. I've only given it a quick skim. I'll read it again before LPW.

Meanwhile, it would be great if people could contribute the info for #1.

Tim.

p.s. Using data_diff() http://search.cpan.org/~timb/DBI/DBI.pm#data_diff
would make the tests shorter.
 my $sample_string = \x{263A};
 ...
 print data_diff($sample_string, $returned_string);

added or data_diff() to tests which fail -
is($x, $y, test) or data_diff($x,$y)

Test::More::is output is horrible for unicode on a non UTF-8 enabled 
terminal.


BTW, the smiley character causes me problems as for some reason with MS 
SQL Server you can do:


insert into table values(?)
bind_param(\{many_unicode_chars})
select column from table where column = $h-quote(\{many_unicode_chars})

but for some reason the smiley does not work in the select SQL even 
though the insert works :-( A parameterized select works fine too.


I will upload the current script to DBI's subversion tree in ex dir but 
if anyone is going to do anything major with it I'd rather they mail say 
dbi-dev or me first and I can at least check that before making any 
changes myself (at least until after LPW). Having said that as the 
birthday boy tomorrow I can almost guarantee I won't be