Re: Add Unicode Support to the DBI
On Mon, Nov 07, 2011 at 01:37:38PM +, Martin J. Evans wrote: I didn't think I was going to make LPW but it seems I will now - although it has cost me big time leaving it until the last minute. All your beers at LPW are on me! http://www.martin-evans.me.uk/node/121 Great work Martin. Many thanks. I've some comments and suggestions for you... It says There is no single way across DBDs to enable Unicode support but doesn't define what Unicode support actually means. Clearly the Unicode support of Oracle will be different to that of a CSV file. So it seems that we need to be really clear about what we want. I'd suggest... 1. Focus initially on categorising the capabilities of the databases. Specifically separating those that understand character encodings at one or more of column, table, schema, database level. Answer the questions: what Unicode support is this database capable of? [vague] are particular column data types or attributes needed? does the db have a session/connection encoding concept? does the db support binary data types. does the client api identify data encoding? A table summarizing this kind of info would be of great value. I think this is the most important kind of data we need to move forward with this topic. I suspect we'll end up with a few clear levels of unicode support by databases that we can then focus on more clearly. 2. Try to make a data-driven common test script. It should fetch the length of the stored value, something like: CREATE TABLE t (c VARCHAR(10)); INSERT INTO t VALUES (?) = $sth-execute(\x{263A}) # simley SELECT LENGTH(c), c FROM t Fetching the LENGTH is important because it tells us if the DB is treating the value as Unicode. The description of DBD::Unify, for example, doesn't clarify if the db itself regards the stored value as unicode or the underlying string of encoded bytes. Also probably best to avoid latin characters for this, I'd use something that always has a multi-byte encoding, like a simley face char. 3. Focus on placeholders initially. We can ponder utf8 in literal SQL later. That's a separate ball of mud. (I'd also ignore unicode table/column/db names. It's a much lower priority and may become clearer when other issues get resolved.) 4. Tests could report local LANG / LC_ALL env var value so when others report their results we'll have that context. Thanks again. I've only given it a quick skim. I'll read it again before LPW. Meanwhile, it would be great if people could contribute the info for #1. Tim. p.s. Using data_diff() http://search.cpan.org/~timb/DBI/DBI.pm#data_diff would make the tests shorter. my $sample_string = \x{263A}; ... print data_diff($sample_string, $returned_string);
Re: Add Unicode Support to the DBI
On Tue, 8 Nov 2011 13:16:17 +, Tim Bunce tim.bu...@pobox.com wrote: On Mon, Nov 07, 2011 at 01:37:38PM +, Martin J. Evans wrote: I didn't think I was going to make LPW but it seems I will now - although it has cost me big time leaving it until the last minute. All your beers at LPW are on me! http://www.martin-evans.me.uk/node/121 Great work Martin. Many thanks. I more and more regret I cannot be there. Please, beside doing useful stuff, have FUN! -- H.Merijn Brand http://tux.nl Perl Monger http://amsterdam.pm.org/ using 5.00307 through 5.14 and porting perl5.15.x on HP-UX 10.20, 11.00, 11.11, 11.23 and 11.31, OpenSuSE 10.1, 11.0 .. 11.4 and AIX 5.2 and 5.3. http://mirrors.develooper.com/hpux/ http://www.test-smoke.org/ http://qa.perl.org http://www.goldmark.org/jeff/stupid-disclaimers/
Re: Add Unicode Support to the DBI
On 08/11/11 13:16, Tim Bunce wrote: On Mon, Nov 07, 2011 at 01:37:38PM +, Martin J. Evans wrote: I didn't think I was going to make LPW but it seems I will now - although it has cost me big time leaving it until the last minute. All your beers at LPW are on me! http://www.martin-evans.me.uk/node/121 Great work Martin. Many thanks. I've some comments and suggestions for you... It says There is no single way across DBDs to enable Unicode support but doesn't define what Unicode support actually means. Clearly the Unicode support of Oracle will be different to that of a CSV file. Point taken. So it seems that we need to be really clear about what we want. I'd suggest... 1. Focus initially on categorising the capabilities of the databases. Specifically separating those that understand character encodings at one or more of column, table, schema, database level. Answer the questions: what Unicode support is this database capable of? [vague] are particular column data types or attributes needed? does the db have a session/connection encoding concept? does the db support binary data types. does the client api identify data encoding? A table summarizing this kind of info would be of great value. I think this is the most important kind of data we need to move forward with this topic. I suspect we'll end up with a few clear levels of unicode support by databases that we can then focus on more clearly. 2. Try to make a data-driven common test script. There is already one attached to the bottom of the post and referred to in the post - probably was not very clear. It should fetch the length of the stored value, something like: CREATE TABLE t (c VARCHAR(10)); INSERT INTO t VALUES (?)= $sth-execute(\x{263A}) # simley SELECT LENGTH(c), c FROM t It does that with a euro \x{20ac} Fetching the LENGTH is important because it tells us if the DB is treating the value as Unicode. It doesn't do that but it checks what went in \x{20ac} is what comes out which is the same as checking the length since what goes in is a euro character. For a euro to come back out at the minimum the DBD would have to decode the data from the database. It also does this with table and column names and table_info/column_info. The description of DBD::Unify, for example, doesn't clarify if the db itself regards the stored value as unicode or the underlying string of encoded bytes. Also probably best to avoid latin characters for this, I'd use something that always has a multi-byte encoding, like a simley face char. I mostly do that but had problems creating tables and columns with a euro in the name (which I can sort of understand as I can believe there are more rules to valid table/column names like they must be alphabetic but could still be unicode). The code to create unicode table/column names uses \x{0100} which is fredĀ (LATIN CAPITAL LETTER A WITH MACRON) which I think is ok. 3. Focus on placeholders initially. We can ponder utf8 in literal SQL later. That's a separate ball of mud. (I'd also ignore unicode table/column/db names. It's a much lower priority and may become clearer when other issues get resolved.) Yeah, I only added that because I knew in DBD::ODBC it did not work because the interface is char * and not Perl scalar. 4. Tests could report local LANG / LC_ALL env var value so when others report their results we'll have that context. Yes, forgot that - will add it to the test. Thanks again. I've only given it a quick skim. I'll read it again before LPW. I will try and update it before then but a) it is my birthday tomorrow and b) I'm a bit under the weather at the moment. Meanwhile, it would be great if people could contribute the info for #1. I am happy to collect any such info and write it up so please at least cc me. Tim. p.s. Using data_diff() http://search.cpan.org/~timb/DBI/DBI.pm#data_diff would make the tests shorter. my $sample_string = \x{263A}; ... print data_diff($sample_string, $returned_string); Yes, the output is a bit screwed because I was running the test code on Windows in a dos terminal and I've never got that working properly. Data::Dumper produces better output but I forgot data_diff - I will change where relevant. BTW, the example code in each DBD was not really the test, it was just an example. My aim was to produce one script which ran to any DBD and that is attached to the end of the blog post. It's just it was too long to incorporate into a blog posting whereas the short simple examples were better. Unfortunately the differences between DBDs (and some limitations and bugs) is making the script quite complex in places e.g., one which caught me out was you have to set LongReadLen in DBD::Oracle before calling prepare but all other DBDs let you do
Re: Add Unicode Support to the DBI
On Tue, Nov 08, 2011 at 02:45:39PM +, Martin J. Evans wrote: On 08/11/11 13:16, Tim Bunce wrote: On Mon, Nov 07, 2011 at 01:37:38PM +, Martin J. Evans wrote: 2. Try to make a data-driven common test script. There is already one attached to the bottom of the post and referred to in the post - probably was not very clear. Ah, it's in the middle, in the DBDs section. Could you put it in github or the DBI repo? I may find some time to hack on it. It should fetch the length of the stored value, something like: CREATE TABLE t (c VARCHAR(10)); INSERT INTO t VALUES (?)= $sth-execute(\x{263A}) # simley SELECT LENGTH(c), c FROM t It does that with a euro \x{20ac} Fetching the LENGTH is important because it tells us if the DB is treating the value as Unicode. It doesn't do that but it checks what went in \x{20ac} is what comes out which is the same as checking the length since what goes in is a euro character. For a euro to come back out at the minimum the DBD would have to decode the data from the database. It's not only important that what goes in comes back out again, but also that what goes in is interpreted by the database in the right way. Otherwise there are lots of more subtle issues like sorting bugs caused by the database 'seeing' encoded bytes instead of unicode characters. (Hence my point about the description for DBD::Unify being incomplete.) It's _possible_ for a db to store a euro as a single byte. So it's possible for the test to yield a false positive. I prefer using a simley because it's not possible to store it in a single byte. So if LENGTH(c) returns 1 we can be very confident that the db is interpreting the data correctly. Thanks again. I've only given it a quick skim. I'll read it again before LPW. I will try and update it before then but a) it is my birthday tomorrow and b) I'm a bit under the weather at the moment. HAPPY BIRTHDAY! Get well soon! Meanwhile, it would be great if people could contribute the info for #1. I am happy to collect any such info and write it up so please at least cc me. p.s. Using data_diff() http://search.cpan.org/~timb/DBI/DBI.pm#data_diff would make the tests shorter. my $sample_string = \x{263A}; ... print data_diff($sample_string, $returned_string); Yes, the output is a bit screwed because I was running the test code on Windows in a dos terminal and I've never got that working properly. Ouch. Data::Dumper produces better output but I forgot data_diff - I will change where relevant. BTW, the example code in each DBD was not really the test, it was just an example. My aim was to produce one script which ran to any DBD and that is attached to the end of the blog post. It's just it was too long to incorporate into a blog posting whereas the short simple examples were better. Yeap. Good idea. I'd simply missed the link when I skimmed it. Thanks again for driving this forward Martin. Tim.
Re: Add Unicode Support to the DBI
On Nov 8, 2011, at 5:16 AM, Tim Bunce wrote: 1. Focus initially on categorising the capabilities of the databases. Specifically separating those that understand character encodings at one or more of column, table, schema, database level. Answer the questions: what Unicode support is this database capable of? [vague] are particular column data types or attributes needed? does the db have a session/connection encoding concept? does the db support binary data types. does the client api identify data encoding? A table summarizing this kind of info would be of great value. I think this is the most important kind of data we need to move forward with this topic. I suspect we'll end up with a few clear levels of unicode support by databases that we can then focus on more clearly. +1. Yes, this should make things pretty clear. 2. Try to make a data-driven common test script. It should fetch the length of the stored value, something like: CREATE TABLE t (c VARCHAR(10)); INSERT INTO t VALUES (?) = $sth-execute(\x{263A}) # simley SELECT LENGTH(c), c FROM t Fetching the LENGTH is important because it tells us if the DB is treating the value as Unicode. The description of DBD::Unify, for example, doesn't clarify if the db itself regards the stored value as unicode or the underlying string of encoded bytes. Also probably best to avoid latin characters for this, I'd use something that always has a multi-byte encoding, like a simley face char. And something that doesn't have a variant that uses combining characters, so that the length should be consistent if it's treated as Unicode. 3. Focus on placeholders initially. We can ponder utf8 in literal SQL later. That's a separate ball of mud. (I'd also ignore unicode table/column/db names. It's a much lower priority and may become clearer when other issues get resolved.) +1, though good to know about. Just as important as placeholders, however, is fetching data. 4. Tests could report local LANG / LC_ALL env var value so when others report their results we'll have that context. Thanks again. I've only given it a quick skim. I'll read it again before LPW. Meanwhile, it would be great if people could contribute the info for #1. Tim. p.s. Using data_diff() http://search.cpan.org/~timb/DBI/DBI.pm#data_diff would make the tests shorter. my $sample_string = \x{263A}; ... print data_diff($sample_string, $returned_string); Can this be turned into a complete script we can all just run? Thanks, David
Re: Add Unicode Support to the DBI
On 08/11/2011 13:16, Tim Bunce wrote: On Mon, Nov 07, 2011 at 01:37:38PM +, Martin J. Evans wrote: I didn't think I was going to make LPW but it seems I will now - although it has cost me big time leaving it until the last minute. All your beers at LPW are on me! http://www.martin-evans.me.uk/node/121 Great work Martin. Many thanks. I've some comments and suggestions for you... It says There is no single way across DBDs to enable Unicode support but doesn't define what Unicode support actually means. Clearly the Unicode support of Oracle will be different to that of a CSV file. So it seems that we need to be really clear about what we want. I'd suggest... 1. Focus initially on categorising the capabilities of the databases. Specifically separating those that understand character encodings at one or more of column, table, schema, database level. Answer the questions: what Unicode support is this database capable of? [vague] are particular column data types or attributes needed? does the db have a session/connection encoding concept? does the db support binary data types. does the client api identify data encoding? A table summarizing this kind of info would be of great value. I think this is the most important kind of data we need to move forward with this topic. I suspect we'll end up with a few clear levels of unicode support by databases that we can then focus on more clearly. this will take some time but I'll work on it. It could be a nightmare for DBD::ODBC as it supports dozens of dbs so I'll have to be a bit cagey about the stuff like what Unicode support is this database capable of. Contributions welcome. 2. Try to make a data-driven common test script. It should fetch the length of the stored value, something like: CREATE TABLE t (c VARCHAR(10)); INSERT INTO t VALUES (?)= $sth-execute(\x{263A}) # simley SELECT LENGTH(c), c FROM t Fetching the LENGTH is important because it tells us if the DB is treating the value as Unicode. The description of DBD::Unify, for example, doesn't clarify if the db itself regards the stored value as unicode or the underlying string of encoded bytes. Changed to add length(c) test - sorry misread that first time as Perl code length(c) not SQL - I'm not well enough to think so clearly right now. Will take some time to run through the DBDs to check whether it is length()/len() or something else - isn't SQL great e.g., DBD::CSV (SQL::Statement) is char_length and most other SQLs are len or length. BTW, DBD::mysql fails the length test using the length function char_length (documented as Return number of characters in argument) - it returns 3 for the smiley not 1, not investigated why yet. Also probably best to avoid latin characters for this, I'd use something that always has a multi-byte encoding, like a simley face char. changed to use smiley face for data but not for table/column name checks as most dbs have other rules on table/column names. 3. Focus on placeholders initially. We can ponder utf8 in literal SQL later. That's a separate ball of mud. (I'd also ignore unicode table/column/db names. It's a much lower priority and may become clearer when other issues get resolved.) test already does that so left in - it is easy to comment it out - one line change. 4. Tests could report local LANG / LC_ALL env var value so when others report their results we'll have that context. added although I stopped short of outputting everything in %ENV as I guessed people would end up having to edit it to remove stuff - it does LANG, LC_* and NLS_* right now. Thanks again. I've only given it a quick skim. I'll read it again before LPW. Meanwhile, it would be great if people could contribute the info for #1. Tim. p.s. Using data_diff() http://search.cpan.org/~timb/DBI/DBI.pm#data_diff would make the tests shorter. my $sample_string = \x{263A}; ... print data_diff($sample_string, $returned_string); added or data_diff() to tests which fail - is($x, $y, test) or data_diff($x,$y) Test::More::is output is horrible for unicode on a non UTF-8 enabled terminal. BTW, the smiley character causes me problems as for some reason with MS SQL Server you can do: insert into table values(?) bind_param(\{many_unicode_chars}) select column from table where column = $h-quote(\{many_unicode_chars}) but for some reason the smiley does not work in the select SQL even though the insert works :-( A parameterized select works fine too. I will upload the current script to DBI's subversion tree in ex dir but if anyone is going to do anything major with it I'd rather they mail say dbi-dev or me first and I can at least check that before making any changes myself (at least until after LPW). Having said that as the birthday boy tomorrow I can almost guarantee I won't be