Re: Add Unicode Support to the DBI

2011-11-09 Thread H.Merijn Brand
On Tue, 08 Nov 2011 21:12:13 +, Martin J. Evans
martin.ev...@easysoft.com wrote:

 I've just checked in unicode_test.pl to DBI's subversion trunk in /ex dir.
 
 It won't run right now without changing the do_connect sub as you have 
 to specify how to connect to the DB.
 Also, there is a DBD specific section at the start where you might have 
 to add a DBD it does not know about (anything other than DBD::ODBC, 
 DBD::Oracle, DBD::SQLite, DBD::CSV, DBD::mysql) if it needs something 
 other than the defaults e.g., the name of the length function in SQL, 
 the column type for unicode columns and binary columns, the setting to 
 enable UTF8/Unicode support. It could be a bit of a pain if your DBD 
 does not support type_info_all but I'm around on irc and in this list if 
 anyone wants any help making it work.
 
 It needs rather a lot of tidying up so I'm not putting it forward as 
 code-of-the-year but it is a start.
 
 BTW, you'll need Test::More::UTF8 and perhaps a couple of other non core 
 modules to run it.
 
 Martin

I'll have some deeper look at both Unify and CSV ...
Attached is a revised version of the script (first argument is the
driver to test, some need more work)

$ perl /tmp/unicode_test.pl Unify
# Driver DBD::Unify-0.78
# Using DBMS_NAME 'Unify DataServer'
# Using DBMS_VER undef
# Using DRIVER_NAME '/pro/asql/v83I/lib/perl/5.10.1/DBD/Unify.pm'
# Using DRIVER_VER '00.78.'
# SQL_IDENTIFIER_CASE 3
# LANGDIR = dutch
print() on closed filehandle $fh at /tmp/unicode_test.pl line 438.
Use of uninitialized value in concatenation (.) or string at 
/tmp/unicode_test.pl line 421.
Use of uninitialized value in concatenation (.) or string at 
/tmp/unicode_test.pl line 421.
# Found type  (HUGE AMOUNT) size=
Use of uninitialized value in string eq at /tmp/unicode_test.pl line 422.
Use of uninitialized value in concatenation (.) or string at 
/tmp/unicode_test.pl line 421.
Use of uninitialized value in concatenation (.) or string at 
/tmp/unicode_test.pl line 421.
# Found type  (AMOUNT) size=
Use of uninitialized value in string eq at /tmp/unicode_test.pl line 422.
Use of uninitialized value in concatenation (.) or string at 
/tmp/unicode_test.pl line 421.
Use of uninitialized value in concatenation (.) or string at 
/tmp/unicode_test.pl line 421.
# Found type  (BINARY) size=
Use of uninitialized value in string eq at /tmp/unicode_test.pl line 422.
Use of uninitialized value in concatenation (.) or string at 
/tmp/unicode_test.pl line 421.
:
:
# Found type  (TIME) size=
Use of uninitialized value in string eq at /tmp/unicode_test.pl line 422.
DBD::Unify::db prepare failed: Syntax error in SQL dynamic statement. [for 
Statement create table fredĀ ( a int)] at /tmp/unicode_test.pl line 214.
not ok 1 - unicode table name supported
#   Failed test 'unicode table name supported'
#   at /tmp/unicode_test.pl line 216.
# died: DBD::Unify::db prepare failed: Syntax error in SQL dynamic statement. 
[for Statement create table fredÄ ( a int)] at /tmp/unicode_test.pl line 214.
ok 2 # skip Failed to create unicode table name
ok 3 # skip Failed to create unicode table name
DBD::Unify::db prepare failed: Syntax error in SQL dynamic statement. [for 
Statement create table fred ( daveĀ int)] at /tmp/unicode_test.pl line 214.
not ok 4 - unicode column name supported


$ perl /tmp/unicode_test.pl
# Driver DBD::SQLite-1.33
# Using DBMS_NAME 'SQLite'
# Using DBMS_VER '3.7.6.3'
# Using DRIVER_NAME undef
# Using DRIVER_VER undef
# SQL_IDENTIFIER_CASE undef
# LANGDIR = dutch
print() on closed filehandle $fh at /tmp/unicode_test.pl line 438.
ok 1 - unicode table name supported
ok 2 - unicode table found in unqualified table_info
ok 3 - unicode table found by qualified table_info
ok 4 - unicode column name supported
ok 5 - unicode column found in unqualified column_info
ok 6 - unicode column found by qualified column_info
ok 7 - table for unicode data
ok 8 - insert unicode data into table
ok 9 - unicode data out = unicode data in, no where
ok 10 - length of output data the same
ok 11 - db length of unicode data correct
ok 12 - select unicode data via parameterised where
ok 13 - select unicode data via inline where
ok 14 - table for unicode data
ok 15 - insert unicode data and blob into table
ok 16 - unicode data out = unicode data in, no where with blob
ok 17 - utf8 flag not set on blob data
Use of uninitialized value in string eq at /tmp/unicode_test.pl line 373.
ok 18 - retrieved blob = inserted blob
ok 19 - test table for unicode parameter markers
DBD::SQLite::st bind_param failed: Unknown named parameter: fred⬠[for 
Statement insert into fred (a) values (:fred€)] at /tmp/unicode_test.pl line 
390.
not ok 20 - bind parameter with unicode parameter marker
#   Failed test 'bind parameter with unicode parameter marker'
#   at /tmp/unicode_test.pl line 392.
# died: DBD::SQLite::st bind_param failed: Unknown named parameter: fred⬠
[for Statement insert into fred (a) values (:fredâ¬)] at /tmp/unicode_test.pl 
line 390.

Re: Add Unicode Support to the DBI

2011-11-09 Thread H.Merijn Brand
On Tue, 08 Nov 2011 21:12:13 +, Martin J. Evans
martin.ev...@easysoft.com wrote:

 I've just checked in unicode_test.pl to DBI's subversion trunk in /ex dir.

So now attached

-- 
H.Merijn Brand  http://tux.nl  Perl Monger  http://amsterdam.pm.org/
using 5.00307 through 5.14 and porting perl5.15.x on HP-UX 10.20, 11.00,
11.11, 11.23 and 11.31, OpenSuSE 10.1, 11.0 .. 11.4 and AIX 5.2 and 5.3.
http://mirrors.develooper.com/hpux/   http://www.test-smoke.org/
http://qa.perl.org  http://www.goldmark.org/jeff/stupid-disclaimers/


unicode_test.pl
Description: Perl program


Re: Add Unicode Support to the DBI

2011-11-09 Thread Tim Bunce
On Wed, Nov 09, 2011 at 04:50:29PM +0100, H.Merijn Brand wrote:
 On Tue, 08 Nov 2011 21:12:13 +, Martin J. Evans
 martin.ev...@easysoft.com wrote:
 
  I've just checked in unicode_test.pl to DBI's subversion trunk in /ex dir.
 
 So now attached

Any chance you could rework your changes into the (recently updated)
version in the DBI svn repo?

Tim.


Re: Add Unicode Support to the DBI

2011-11-09 Thread H.Merijn Brand
On Wed, 9 Nov 2011 16:23:53 +, Tim Bunce tim.bu...@pobox.com
wrote:

 On Wed, Nov 09, 2011 at 04:50:29PM +0100, H.Merijn Brand wrote:
  On Tue, 08 Nov 2011 21:12:13 +, Martin J. Evans
  martin.ev...@easysoft.com wrote:
  
   I've just checked in unicode_test.pl to DBI's subversion trunk in /ex dir.
  
  So now attached
 
 Any chance you could rework your changes into the (recently updated)
 version in the DBI svn repo?

Yes, after I verify why DBD::Unify messes up on this

(FWIW, the number of Unify customers is getting smaller and smaller, so
the motivation to improve on DBD::Unify is diminishing)

 Tim.

-- 
H.Merijn Brand  http://tux.nl  Perl Monger  http://amsterdam.pm.org/
using 5.00307 through 5.14 and porting perl5.15.x on HP-UX 10.20, 11.00,
11.11, 11.23 and 11.31, OpenSuSE 10.1, 11.0 .. 11.4 and AIX 5.2 and 5.3.
http://mirrors.develooper.com/hpux/   http://www.test-smoke.org/
http://qa.perl.org  http://www.goldmark.org/jeff/stupid-disclaimers/


Re: Add Unicode Support to the DBI

2011-11-09 Thread Martin J. Evans

On 09/11/2011 15:49, H.Merijn Brand wrote:

On Tue, 08 Nov 2011 21:12:13 +, Martin J. Evans
martin.ev...@easysoft.com  wrote:


I've just checked in unicode_test.pl to DBI's subversion trunk in /ex dir.

It won't run right now without changing the do_connect sub as you have
to specify how to connect to the DB.
Also, there is a DBD specific section at the start where you might have
to add a DBD it does not know about (anything other than DBD::ODBC,
DBD::Oracle, DBD::SQLite, DBD::CSV, DBD::mysql) if it needs something
other than the defaults e.g., the name of the length function in SQL,
the column type for unicode columns and binary columns, the setting to
enable UTF8/Unicode support. It could be a bit of a pain if your DBD
does not support type_info_all but I'm around on irc and in this list if
anyone wants any help making it work.

It needs rather a lot of tidying up so I'm not putting it forward as
code-of-the-year but it is a start.

BTW, you'll need Test::More::UTF8 and perhaps a couple of other non core
modules to run it.

Martin

I'll have some deeper look at both Unify and CSV ...
Attached is a revised version of the script (first argument is the
driver to test, some need more work)

$ perl /tmp/unicode_test.pl Unify
# Driver DBD::Unify-0.78
# Using DBMS_NAME 'Unify DataServer'
# Using DBMS_VER undef
# Using DRIVER_NAME '/pro/asql/v83I/lib/perl/5.10.1/DBD/Unify.pm'
# Using DRIVER_VER '00.78.'
# SQL_IDENTIFIER_CASE 3
# LANGDIR = dutch
print() on closed filehandle $fh at /tmp/unicode_test.pl line 438.
Use of uninitialized value in concatenation (.) or string at 
/tmp/unicode_test.pl line 421.
Use of uninitialized value in concatenation (.) or string at 
/tmp/unicode_test.pl line 421.
# Found type  (HUGE AMOUNT) size=
Use of uninitialized value in string eq at /tmp/unicode_test.pl line 422.
Use of uninitialized value in concatenation (.) or string at 
/tmp/unicode_test.pl line 421.
Use of uninitialized value in concatenation (.) or string at 
/tmp/unicode_test.pl line 421.
# Found type  (AMOUNT) size=
Use of uninitialized value in string eq at /tmp/unicode_test.pl line 422.
Use of uninitialized value in concatenation (.) or string at 
/tmp/unicode_test.pl line 421.
Use of uninitialized value in concatenation (.) or string at 
/tmp/unicode_test.pl line 421.
# Found type  (BINARY) size=
Use of uninitialized value in string eq at /tmp/unicode_test.pl line 422.
Use of uninitialized value in concatenation (.) or string at 
/tmp/unicode_test.pl line 421.
:
:
# Found type  (TIME) size=
Use of uninitialized value in string eq at /tmp/unicode_test.pl line 422.
DBD::Unify::db prepare failed: Syntax error in SQL dynamic statement. [for Statement 
create table fredĀ ( a int)] at /tmp/unicode_test.pl line 214.
not ok 1 - unicode table name supported
#   Failed test 'unicode table name supported'
#   at /tmp/unicode_test.pl line 216.
# died: DBD::Unify::db prepare failed: Syntax error in SQL dynamic statement. [for 
Statement create table fredÄ ( a int)] at /tmp/unicode_test.pl line 214.
ok 2 # skip Failed to create unicode table name
ok 3 # skip Failed to create unicode table name
DBD::Unify::db prepare failed: Syntax error in SQL dynamic statement. [for Statement 
create table fred ( daveĀ int)] at /tmp/unicode_test.pl line 214.
not ok 4 - unicode column name supported
Your going to have a lot of problems with this test code and DBD::Unify 
as we previously discovered that DBD::Unify does not decode the data 
coming back from the database itself but it can be decoded by any Perl 
script using DBD::Unify into the correct data. Any chance you could 
change the test code to print out the results of type_info_all for 
DBD::Unify and send me them?


I'd like to fix all those Use of unitialized value warnings so I'll look 
into that. The test code has changed so the line numbers are a problem 
so I'll have to compare them with your attachment. I'm guessing they are 
all in the processing of the results from type_info_all. To be honest, 
this is not a unicode/encoding issue but one of my biggest annoyances 
when working with DBDs and that is finding the right column type for a 
table is really hit and miss - many DBDs don't support type_info_all and 
some that do only provide a subset of the columns DBI defines.


Somewhere (I've lost it right now but I'll find it later) I have a 
document which details my experiences of trying to write DBD independent 
code which worked with DBD::mysql, DBD::DB2 and DBD::Oracle. In the end 
we gave up - it was just too difficult. It wasn't just the different SQL 
syntax/support. We ended up putting all the logic into 
procedures/functions and simply calling them from Perl which avoided SQL 
syntax issues and loads of other differences. Perhaps I'll go back to 
that if I find some time.




$ perl /tmp/unicode_test.pl
# Driver DBD::SQLite-1.33
# Using DBMS_NAME 'SQLite'
# Using DBMS_VER '3.7.6.3'
# Using DRIVER_NAME undef
# Using DRIVER_VER undef
# SQL_IDENTIFIER_CASE undef
# 

Re: Add Unicode Support to the DBI

2011-11-09 Thread H.Merijn Brand
On Wed, 09 Nov 2011 19:41:33 +, Martin J. Evans
martin.ev...@easysoft.com wrote:

 Your going to have a lot of problems with this test code and DBD::Unify 
 as we previously discovered that DBD::Unify does not decode the data 
 coming back from the database itself but it can be decoded by any Perl 
 script using DBD::Unify into the correct data. Any chance you could 
 change the test code to print out the results of type_info_all for 
 DBD::Unify and send me them?

Right, I have this:
--8---
$type_info_all = [
{   TYPE_NAME  =  0,
DATA_TYPE  =  1,
COLUMN_SIZE=  2,
LITERAL_PREFIX =  3,
LITERAL_SUFFIX =  4,
CREATE_PARAMS  =  5,
NULLABLE   =  6,
CASE_SENSITIVE =  7,
SEARCHABLE =  8,
UNSIGNED_ATTRIBUTE =  9,
FIXED_PREC_SCALE   = 10,
AUTO_UNIQUE_VALUE  = 11,
LOCAL_TYPE_NAME= 12,
MINIMUM_SCALE  = 13,
MAXIMUM_SCALE  = 14,
SQL_DATA_TYPE  = 15,
SQL_DATETIME_SUB   = 16,
NUM_PREC_RADIX = 17,
INTERVAL_PRECISION = 18,
},

# TYPE_NAME   DATA_TYPESIZE  PFX   SFX   PARAMS
N C S UNSIG FPS   AUTO  LOCAL MINSC MAXSC SDT   SDS   RADIX PREC
[ UNKNOWN,  0,   undef,undef,undef,undef,
1,0,3,undef,undef,undef,undef,undef,undef,undef,undef,undef,undef, ],
#   [ GIANT AMOUNT,  
undef,undef,undef,PRECISION,SCALE,1,0,3,0,undef,undef,undef,2,2,
undef,undef,undef,undef, ],
[ HUGE AMOUNT,  -207,
undef,undef,undef,PRECISION,SCALE,1,0,3,0,undef,undef,undef,2,2,
undef,undef,undef,undef, ],
[ AMOUNT,   -206,
undef,undef,undef,PRECISION,SCALE,1,0,3,0,undef,undef,undef,2,2,
undef,undef,undef,undef, ],
[ VARBINARY,SQL_VARBINARY,   undef,', ,',  undef,
1,0,3,undef,undef,undef,undef,undef,undef,undef,undef,undef,undef, ],
[ BINARY,   SQL_BINARY,  undef,undef,undef,undef,
1,0,3,0,undef,undef,undef,0,undef,undef,undef,undef,undef, ],
[ CHAR, SQL_CHAR,undef,undef,undef,PRECISION,  
1,0,3,0,undef,undef,undef,0,undef,undef,undef,undef,undef, ],
[ CURRENCY, -218,
undef,undef,undef,PRECISION,SCALE,1,0,3,0,2,undef,undef,0,8,
undef,undef,undef,undef, ],
[ TIMESTAMP,SQL_TIMESTAMP,   undef,undef,undef,undef,
1,0,3,undef,undef,undef,undef,undef,undef,undef,undef,undef,undef, ],
[ DATE, SQL_DATE,undef,undef,undef,undef,
1,0,3,undef,undef,undef,undef,undef,undef,undef,undef,undef,undef, ],
[ DECIMAL,  SQL_DECIMAL, undef,undef,undef,PRECISION,  
1,0,3,0,undef,undef,undef,0,undef,undef,undef,undef,undef, ],
[ DOUBLE PRECISION, SQL_DOUBLE,  undef,undef,undef,PRECISION,  
1,0,3,0,undef,undef,undef,0,undef,undef,undef,undef,undef, ],
[ FLOAT,SQL_FLOAT,   undef,undef,undef,PRECISION,  
1,0,3,0,undef,undef,undef,0,undef,undef,undef,undef,undef, ],
[ HUGE INTEGER, SQL_BIGINT,  undef,undef,undef,PRECISION,  
1,0,3,0,undef,undef,undef,0,0,undef,undef,undef,undef, ],
[ INTEGER,  SQL_INTEGER, undef,undef,undef,PRECISION,  
1,0,3,0,undef,undef,undef,0,undef,undef,undef,undef,undef, ],
[ NUMERIC,  SQL_NUMERIC, undef,undef,undef,PRECISION,  
1,0,3,0,undef,undef,undef,0,0,undef,undef,undef,undef, ],
[ REAL, SQL_REAL,undef,undef,undef,PRECISION,  
1,0,3,0,undef,undef,undef,0,undef,undef,undef,undef,undef, ],
[ SMALLINT, SQL_SMALLINT,undef,undef,undef,PRECISION,  
1,0,3,0,undef,undef,undef,0,0,undef,undef,undef,undef, ],
[ TEXT, SQL_LONGVARCHAR, undef,', ,',  undef,
1,0,3,undef,undef,undef,undef,undef,undef,undef,undef,undef,undef, ],
[ TIME, SQL_TIME,undef,undef,undef,undef,
1,0,3,undef,undef,undef,undef,undef,undef,undef,undef,undef,undef, ],
];

my %odbc_types = map { ( $_-[0] = $_-[1], $_-[1] = $_-[0] ) }
[  -5  = BIGINT  ], # SQL_BIGINT
[  -3  = VARBINARY   ], # SQL_VARBINARY
[  -2  = BINARY  ], # SQL_BINARY
[  -1  = TEXT], # SQL_LONGVARCHAR
[   0  = UNKNOWN_TYPE], # SQL_UNKNOWN_TYPE
[   1  = CHAR], # SQL_CHAR
[   2  = NUMERIC ], # SQL_NUMERIC
[   3  = DECIMAL ], # SQL_DECIMAL
[   4  = INTEGER ], # SQL_INTEGER
[   5  = SMALLINT], # SQL_SMALLINT
[   6  = FLOAT   ], # SQL_FLOAT
[   7  = REAL], # SQL_REAL
[   8  = DOUBLE PRECISION], # SQL_DOUBLE
[   9  = DATE], # SQL_DATE
[  10  = TIME], # SQL_TIME
[  11 

Re: Add Unicode Support to the DBI

2011-11-08 Thread Tim Bunce
On Mon, Nov 07, 2011 at 01:37:38PM +, Martin J. Evans wrote:
 
 I didn't think I was going to make LPW but it seems I will now - although it 
 has cost me big time leaving it until the last minute.

All your beers at LPW are on me!

 http://www.martin-evans.me.uk/node/121

Great work Martin. Many thanks.

I've some comments and suggestions for you...

It says There is no single way across DBDs to enable Unicode support
but doesn't define what Unicode support actually means. Clearly the
Unicode support of Oracle will be different to that of a CSV file.
So it seems that we need to be really clear about what we want.

I'd suggest...

1. Focus initially on categorising the capabilities of the databases.
Specifically separating those that understand character encodings
at one or more of column, table, schema, database level.
Answer the questions:
what Unicode support is this database capable of? [vague]
are particular column data types or attributes needed?
does the db have a session/connection encoding concept?
does the db support binary data types.
does the client api identify data encoding?
A table summarizing this kind of info would be of great value.
I think this is the most important kind of data we need to move
forward with this topic.  I suspect we'll end up with a few clear
levels of unicode support by databases that we can then focus on
more clearly.

2. Try to make a data-driven common test script.
It should fetch the length of the stored value, something like:
CREATE TABLE t (c VARCHAR(10));
INSERT INTO t VALUES (?)   =  $sth-execute(\x{263A}) # simley
SELECT LENGTH(c), c FROM t
Fetching the LENGTH is important because it tells us if the DB is
treating the value as Unicode.  The description of DBD::Unify, for
example, doesn't clarify if the db itself regards the stored value
as unicode or the underlying string of encoded bytes.
Also probably best to avoid latin characters for this, I'd use
something that always has a multi-byte encoding, like a simley face char.

3. Focus on placeholders initially.
We can ponder utf8 in literal SQL later. That's a separate ball of mud.
(I'd also ignore unicode table/column/db names. It's a much lower
priority and may become clearer when other issues get resolved.)

4. Tests could report local LANG / LC_ALL env var value
so when others report their results we'll have that context.

Thanks again. I've only given it a quick skim. I'll read it again before LPW.

Meanwhile, it would be great if people could contribute the info for #1.

Tim.

p.s. Using data_diff() http://search.cpan.org/~timb/DBI/DBI.pm#data_diff
would make the tests shorter.
my $sample_string = \x{263A};
...
print data_diff($sample_string, $returned_string);


Re: Add Unicode Support to the DBI

2011-11-08 Thread H.Merijn Brand
On Tue, 8 Nov 2011 13:16:17 +, Tim Bunce tim.bu...@pobox.com
wrote:

 On Mon, Nov 07, 2011 at 01:37:38PM +, Martin J. Evans wrote:
  
   I didn't think I was going to make LPW but it seems I will now -
   although it has cost me big time leaving it until the last minute.  
 
 All your beers at LPW are on me!
 
  http://www.martin-evans.me.uk/node/121  
 
 Great work Martin. Many thanks.

I more and more regret I cannot be there.
Please, beside doing useful stuff, have FUN!

-- 
H.Merijn Brand  http://tux.nl  Perl Monger  http://amsterdam.pm.org/
using 5.00307 through 5.14 and porting perl5.15.x on HP-UX 10.20, 11.00,
11.11, 11.23 and 11.31, OpenSuSE 10.1, 11.0 .. 11.4 and AIX 5.2 and 5.3.
http://mirrors.develooper.com/hpux/   http://www.test-smoke.org/
http://qa.perl.org  http://www.goldmark.org/jeff/stupid-disclaimers/


Re: Add Unicode Support to the DBI

2011-11-08 Thread Martin J. Evans

On 08/11/11 13:16, Tim Bunce wrote:

On Mon, Nov 07, 2011 at 01:37:38PM +, Martin J. Evans wrote:


I didn't think I was going to make LPW but it seems I will now - although it 
has cost me big time leaving it until the last minute.


All your beers at LPW are on me!


http://www.martin-evans.me.uk/node/121


Great work Martin. Many thanks.

I've some comments and suggestions for you...

It says There is no single way across DBDs to enable Unicode support
but doesn't define what Unicode support actually means. Clearly the
Unicode support of Oracle will be different to that of a CSV file.


Point taken.


So it seems that we need to be really clear about what we want.

I'd suggest...

1. Focus initially on categorising the capabilities of the databases.
 Specifically separating those that understand character encodings
 at one or more of column, table, schema, database level.
 Answer the questions:
 what Unicode support is this database capable of? [vague]
 are particular column data types or attributes needed?
 does the db have a session/connection encoding concept?
 does the db support binary data types.
 does the client api identify data encoding?
 A table summarizing this kind of info would be of great value.
 I think this is the most important kind of data we need to move
 forward with this topic.  I suspect we'll end up with a few clear
 levels of unicode support by databases that we can then focus on
 more clearly.




2. Try to make a data-driven common test script.


There is already one attached to the bottom of the post and referred to in the 
post - probably was not very clear.


 It should fetch the length of the stored value, something like:
 CREATE TABLE t (c VARCHAR(10));
 INSERT INTO t VALUES (?)=  $sth-execute(\x{263A}) # simley
 SELECT LENGTH(c), c FROM t


It does that with a euro \x{20ac}


 Fetching the LENGTH is important because it tells us if the DB is
 treating the value as Unicode.


It doesn't do that but it checks what went in \x{20ac} is what comes out which 
is the same as checking the length since what goes in is a euro character. For 
a euro to come back out at the minimum the DBD would have to decode the data 
from the database.

It also does this with table and column names and table_info/column_info.


 The description of DBD::Unify, for
 example, doesn't clarify if the db itself regards the stored value
 as unicode or the underlying string of encoded bytes.
 Also probably best to avoid latin characters for this, I'd use
 something that always has a multi-byte encoding, like a simley face char.


I mostly do that but had problems creating tables and columns with a euro in 
the name (which I can sort of understand as I can believe there are more rules 
to valid table/column names like they must be alphabetic but could still be 
unicode). The code to create unicode table/column names uses \x{0100} which is

fredĀ (LATIN CAPITAL LETTER A WITH MACRON)

which I think is ok.


3. Focus on placeholders initially.
 We can ponder utf8 in literal SQL later. That's a separate ball of mud.
 (I'd also ignore unicode table/column/db names. It's a much lower
 priority and may become clearer when other issues get resolved.)


Yeah, I only added that because I knew in DBD::ODBC it did not work because the 
interface is char * and not Perl scalar.
 

4. Tests could report local LANG / LC_ALL env var value
 so when others report their results we'll have that context.


Yes, forgot that - will add it to the test.
 

Thanks again. I've only given it a quick skim. I'll read it again before LPW.


I will try and update it before then but a) it is my birthday tomorrow and b) 
I'm a bit under the weather at the moment.

 

Meanwhile, it would be great if people could contribute the info for #1.


I am happy to collect any such info and write it up so please at least cc me.
 

Tim.

p.s. Using data_diff() http://search.cpan.org/~timb/DBI/DBI.pm#data_diff
would make the tests shorter.
 my $sample_string = \x{263A};
 ...
 print data_diff($sample_string, $returned_string);


Yes, the output is a bit screwed because I was running the test code on Windows 
in a dos terminal and I've never got that working properly. Data::Dumper 
produces better output but I forgot data_diff - I will change where relevant.

BTW, the example code in each DBD was not really the test, it was just an 
example. My aim was to produce one script which ran to any DBD and that is 
attached to the end of the blog post. It's just it was too long to incorporate 
into a blog posting whereas the short simple examples were better. 
Unfortunately the differences between DBDs (and some limitations and bugs) is 
making the script quite complex in places e.g., one which caught me out was you 
have to set LongReadLen in DBD::Oracle before calling prepare but all other 
DBDs let you do 

Re: Add Unicode Support to the DBI

2011-11-08 Thread Tim Bunce
On Tue, Nov 08, 2011 at 02:45:39PM +, Martin J. Evans wrote:
 On 08/11/11 13:16, Tim Bunce wrote:
 On Mon, Nov 07, 2011 at 01:37:38PM +, Martin J. Evans wrote:
 
 2. Try to make a data-driven common test script.
 
 There is already one attached to the bottom of the post and referred to in 
 the post - probably was not very clear.

Ah, it's in the middle, in the DBDs section.
Could you put it in github or the DBI repo?
I may find some time to hack on it.

  It should fetch the length of the stored value, something like:
  CREATE TABLE t (c VARCHAR(10));
  INSERT INTO t VALUES (?)=  $sth-execute(\x{263A}) # simley
  SELECT LENGTH(c), c FROM t
 
 It does that with a euro \x{20ac}
 
  Fetching the LENGTH is important because it tells us if the DB is
  treating the value as Unicode.
 
 It doesn't do that but it checks what went in \x{20ac} is what comes
 out which is the same as checking the length since what goes in is a
 euro character. For a euro to come back out at the minimum the DBD
 would have to decode the data from the database.

It's not only important that what goes in comes back out again, but also
that what goes in is interpreted by the database in the right way.
Otherwise there are lots of more subtle issues like sorting bugs caused
by the database 'seeing' encoded bytes instead of unicode characters.
(Hence my point about the description for DBD::Unify being incomplete.)

It's _possible_ for a db to store a euro as a single byte. So it's
possible for the test to yield a false positive.  I prefer using a
simley because it's not possible to store it in a single byte.
So if LENGTH(c) returns 1 we can be very confident that the db is
interpreting the data correctly.

 Thanks again. I've only given it a quick skim. I'll read it again before LPW.
 
 I will try and update it before then but a) it is my birthday tomorrow and b) 
 I'm a bit under the weather at the moment.

HAPPY BIRTHDAY!
Get well soon!

 Meanwhile, it would be great if people could contribute the info for #1.
 
 I am happy to collect any such info and write it up so please at least cc me.

 p.s. Using data_diff() http://search.cpan.org/~timb/DBI/DBI.pm#data_diff
 would make the tests shorter.
  my $sample_string = \x{263A};
  ...
  print data_diff($sample_string, $returned_string);
 
 Yes, the output is a bit screwed because I was running the test code
 on Windows in a dos terminal and I've never got that working properly.

Ouch.

 Data::Dumper produces better output but I forgot data_diff - I will
 change where relevant.
 
 BTW, the example code in each DBD was not really the test, it was just
 an example. My aim was to produce one script which ran to any DBD and
 that is attached to the end of the blog post. It's just it was too
 long to incorporate into a blog posting whereas the short simple
 examples were better.

Yeap. Good idea. I'd simply missed the link when I skimmed it.

Thanks again for driving this forward Martin.

Tim.


Re: Add Unicode Support to the DBI

2011-11-08 Thread David E. Wheeler
On Nov 8, 2011, at 5:16 AM, Tim Bunce wrote:

 1. Focus initially on categorising the capabilities of the databases.
Specifically separating those that understand character encodings
at one or more of column, table, schema, database level.
Answer the questions:
what Unicode support is this database capable of? [vague]
are particular column data types or attributes needed?
does the db have a session/connection encoding concept?
does the db support binary data types.
does the client api identify data encoding?
A table summarizing this kind of info would be of great value.
I think this is the most important kind of data we need to move
forward with this topic.  I suspect we'll end up with a few clear
levels of unicode support by databases that we can then focus on
more clearly.

+1. Yes, this should make things pretty clear.

 2. Try to make a data-driven common test script.
It should fetch the length of the stored value, something like:
CREATE TABLE t (c VARCHAR(10));
INSERT INTO t VALUES (?)   =  $sth-execute(\x{263A}) # simley
SELECT LENGTH(c), c FROM t
Fetching the LENGTH is important because it tells us if the DB is
treating the value as Unicode.  The description of DBD::Unify, for
example, doesn't clarify if the db itself regards the stored value
as unicode or the underlying string of encoded bytes.
Also probably best to avoid latin characters for this, I'd use
something that always has a multi-byte encoding, like a simley face char.

And something that doesn't have a variant that uses combining characters, so 
that the length should be consistent if it's treated as Unicode.

 3. Focus on placeholders initially.
We can ponder utf8 in literal SQL later. That's a separate ball of mud.
(I'd also ignore unicode table/column/db names. It's a much lower
priority and may become clearer when other issues get resolved.)

+1, though good to know about. Just as important as placeholders, however, is 
fetching data.

 4. Tests could report local LANG / LC_ALL env var value
so when others report their results we'll have that context.
 
 Thanks again. I've only given it a quick skim. I'll read it again before LPW.
 
 Meanwhile, it would be great if people could contribute the info for #1.
 
 Tim.
 
 p.s. Using data_diff() http://search.cpan.org/~timb/DBI/DBI.pm#data_diff
 would make the tests shorter.
my $sample_string = \x{263A};
...
print data_diff($sample_string, $returned_string);

Can this be turned into a complete script we can all just run?

Thanks,

David





Re: Add Unicode Support to the DBI

2011-11-08 Thread Martin J. Evans

On 08/11/2011 13:16, Tim Bunce wrote:

On Mon, Nov 07, 2011 at 01:37:38PM +, Martin J. Evans wrote:

I didn't think I was going to make LPW but it seems I will now - although it 
has cost me big time leaving it until the last minute.

All your beers at LPW are on me!


http://www.martin-evans.me.uk/node/121

Great work Martin. Many thanks.

I've some comments and suggestions for you...

It says There is no single way across DBDs to enable Unicode support
but doesn't define what Unicode support actually means. Clearly the
Unicode support of Oracle will be different to that of a CSV file.
So it seems that we need to be really clear about what we want.

I'd suggest...

1. Focus initially on categorising the capabilities of the databases.
 Specifically separating those that understand character encodings
 at one or more of column, table, schema, database level.
 Answer the questions:
 what Unicode support is this database capable of? [vague]
 are particular column data types or attributes needed?
 does the db have a session/connection encoding concept?
 does the db support binary data types.
 does the client api identify data encoding?
 A table summarizing this kind of info would be of great value.
 I think this is the most important kind of data we need to move
 forward with this topic.  I suspect we'll end up with a few clear
 levels of unicode support by databases that we can then focus on
 more clearly.
this will take some time but I'll work on it. It could be a nightmare 
for DBD::ODBC as it supports dozens of dbs so I'll have to be a bit 
cagey about the stuff like what Unicode support is this database 
capable of.


Contributions welcome.


2. Try to make a data-driven common test script.
 It should fetch the length of the stored value, something like:
 CREATE TABLE t (c VARCHAR(10));
 INSERT INTO t VALUES (?)=  $sth-execute(\x{263A}) # simley
 SELECT LENGTH(c), c FROM t
 Fetching the LENGTH is important because it tells us if the DB is
 treating the value as Unicode.  The description of DBD::Unify, for
 example, doesn't clarify if the db itself regards the stored value
 as unicode or the underlying string of encoded bytes.


Changed to add length(c) test - sorry misread that first time as Perl 
code length(c) not SQL - I'm not well enough to think so clearly right now.


Will take some time to run through the DBDs to check whether it is 
length()/len() or something else - isn't SQL great
e.g., DBD::CSV (SQL::Statement) is char_length and most other SQLs are 
len or length.


BTW, DBD::mysql fails the length test using the length function 
char_length (documented as Return number of characters in argument) - it 
returns 3 for the smiley not 1, not investigated why yet.



 Also probably best to avoid latin characters for this, I'd use
 something that always has a multi-byte encoding, like a simley face char.
changed to use smiley face for data but not for table/column name checks 
as most dbs have other rules on table/column names.



3. Focus on placeholders initially.
 We can ponder utf8 in literal SQL later. That's a separate ball of mud.
 (I'd also ignore unicode table/column/db names. It's a much lower
 priority and may become clearer when other issues get resolved.)
test already does that so left in - it is easy to comment it out - one 
line change.

4. Tests could report local LANG / LC_ALL env var value
 so when others report their results we'll have that context.
added although I stopped short of outputting everything in %ENV as I 
guessed people would end up having to edit it to remove stuff - it does 
LANG, LC_* and NLS_* right now.



Thanks again. I've only given it a quick skim. I'll read it again before LPW.

Meanwhile, it would be great if people could contribute the info for #1.

Tim.

p.s. Using data_diff() http://search.cpan.org/~timb/DBI/DBI.pm#data_diff
would make the tests shorter.
 my $sample_string = \x{263A};
 ...
 print data_diff($sample_string, $returned_string);

added or data_diff() to tests which fail -
is($x, $y, test) or data_diff($x,$y)

Test::More::is output is horrible for unicode on a non UTF-8 enabled 
terminal.


BTW, the smiley character causes me problems as for some reason with MS 
SQL Server you can do:


insert into table values(?)
bind_param(\{many_unicode_chars})
select column from table where column = $h-quote(\{many_unicode_chars})

but for some reason the smiley does not work in the select SQL even 
though the insert works :-( A parameterized select works fine too.


I will upload the current script to DBI's subversion tree in ex dir but 
if anyone is going to do anything major with it I'd rather they mail say 
dbi-dev or me first and I can at least check that before making any 
changes myself (at least until after LPW). Having said that as the 
birthday boy tomorrow I can almost guarantee I won't be 

Re: Add Unicode Support to the DBI

2011-11-07 Thread Martin J. Evans

On 04/11/11 08:39, Martin J. Evans wrote:

On 03/11/11 23:25, David E. Wheeler wrote:

On Oct 7, 2011, at 5:06 PM, David E. Wheeler wrote:


Perhaps we could carve out some time at LPW to sit together and try to
progress this.


That would be awesome you guys!


So gents, do you plan to do this a bit? Martin, do you have the data you wanted 
to collect on this?

Thanks,

David



Sorry David, I've been snowed under. I will try very hard to publish the 
research I found this weekend.

I didn't think I was going to make LPW but it seems I will now - although it 
has cost me big time leaving it until the last minute.

Martin


http://www.martin-evans.me.uk/node/121

Martin
--
Martin J. Evans
Easysoft Limited
http://www.easysoft.com


Re: Add Unicode Support to the DBI

2011-11-06 Thread Martin J. Evans

On 05/10/2011 00:06, Jonathan Leffler wrote:

On Tue, Oct 4, 2011 at 15:24, Martin J. Evansmartin.ev...@easysoft.comwrote:


On 04/10/2011 22:38, Tim Bunce wrote:


I've not had time to devote to this thread. Sorry.

I'd be grateful if someone could post a summary of it if/when it
approaches some kind of consensus.

  I don't think there is a kind of consensus right now (although some

useful discussion which probably will bear fruit) and I'd prefer to work out
what unicode support already exists and how it is implemented first. For
instance, Pg is very focussed on UTF-8 (as are most DBDs) and yet ODBC uses
UCS2 under the hood and CSV can use anything you like. Greg/David/Postgres
seem to have an immediate problem with unicode support in Postgres and I can
imagine they are keen to resolve it and I'd suggest they do it now in the
most appropriate way for DBD::Pg. I don't see why this should necessarily
impact on any discussion as to what DBI should_do/should_say as already the
DBDs which support unicode mostly do it in different ways.

I've started gathering together details of what unicode support there is in
DBDs, how it is implemented and what special flags there are to support it.
However, this is a massive task. So far I've done ODBC, Oracle, CSV, Unify,
mysql, SQLite, Firebird and sort of held off on Pg as I knew Greg was
working on it. Some might disagree but DB2 is a main one I no longer have
access to (please contact me if you use DBD::DB2 and are prepared to spare
half an hour or so to modify examples I have which verify unicode support).
Of course, if you use another DBD and can send me info on unicode support
I'd love to hear from you.

I thought the whole issue was an interesting topic and I had toyed with
doing a talk for LPW but to be honest, it is already taking a lot of time
and I have personal issues right now (and of course my $work) which mean my
time is severely limited so I'm doubtful right now if I could have it ready
in time as a talk. I might just post what I have gathered in a weeks time in
the hope I get a little more input in the mean time.



DBD::Informix has had a couple of UTF-8 patches sent and one has been
applied to the code.  The other arrived this morning and has to be
compared.  The attribute names chosen are different, but both contain 'ix_'
as a prefix and UTF-8 in some form or another.
I couldn't find a respository for Informix so I couldn't see what these 
changes were.

I've added your comments to my document - to be released tonight hopefully.


What I'm not sure about is how to test the code.  Creating an Informix
database that has UTF-8 data in it is trivial (well, nearly trivial).  The
difficulty is demonstrating where there were problems before and that the
problems are gone after.




If anyone has suggestions for how to show that UTF8 is working properly - in
the form of a fairly simple test case - I'd be very grateful to receive it
as guidance.

Attached to this post is some code I'm using however:

o it tries to use type_info_all to find relevant column types and so far 
DBD::SQLite is the only one not to support type_info_all. If you've got 
type_info_all support try and make it work. You'll need to look for the 
find_type calls (2 of them) and adjust to add informix SQL types. 
Alternatively, if you have not got type_info_all support you'll need to 
replicate what I did for DBD::SQLite - see start of script.


The code attempts to work around the problem that some DBDs do not 
report all possible columns back from type_info_all.


o few DBDs seem to support column_info_all - if you have not got it - 
that test is skipped


o you'll need an in.png PNG image which should preferably be smaller 
than the size of your blob column type. It does not matter what it is - 
any old small png will do.


o no drivers so far support unicode parameter markers - what a surprise

o some drivers have issues with blobs and unicode data in the same table 
- DBD::CSV notably


o some drivers cannot create unicode table names or tables with unicode 
column names e.g., DBD::mysql.


o no driver so far can do a table_info with a specified unicode (and 
existing) table name and return the table info. I suspect this is 
because the table_info and column_info calls are in XS using char * 
instead of Perl scalars.


If you make this work for Informix please send it back to me.

Same applies to any other DBD maintainers.

So, DBD::Informix is endeavouring to move forward, knowing that the
underlying database layers (ESQL/C on the client, and the Informix data
server) handle UTF-8 OK, so any problems are in persuading Perl (and perhaps
DBI) to handle it appropriately too.

[...Did I hear a chorus of nice theory - shame about the practice?...]


Martin

#
# Test unicode in a DBD - written for DBD::ODBC but should work for other
# DBDs if you change the column types at the start of this script.
# To run properly it needs an in.png PNG image file in the local working
# directory but 

Re: Add Unicode Support to the DBI

2011-11-04 Thread Martin J. Evans

On 03/11/11 23:25, David E. Wheeler wrote:

On Oct 7, 2011, at 5:06 PM, David E. Wheeler wrote:


Perhaps we could carve out some time at LPW to sit together and try to
progress this.


That would be awesome you guys!


So gents, do you plan to do this a bit? Martin, do you have the data you wanted 
to collect on this?

Thanks,

David



Sorry David, I've been snowed under. I will try very hard to publish the 
research I found this weekend.

I didn't think I was going to make LPW but it seems I will now - although it 
has cost me big time leaving it until the last minute.

Martin
--
Martin J. Evans
Easysoft Limited
http://www.easysoft.com


Re: Add Unicode Support to the DBI

2011-11-04 Thread David E. Wheeler
On Nov 4, 2011, at 1:39 AM, Martin J. Evans wrote:

 Sorry David, I've been snowed under. I will try very hard to publish the 
 research I found this weekend.

Awesome, thanks.

Did you ever get any data from DBD::SQLite folks?

 I didn't think I was going to make LPW but it seems I will now - although it 
 has cost me big time leaving it until the last minute.

Your sacrifice is greatly appreciated. :-)

Best,

David



Re: Add Unicode Support to the DBI

2011-11-04 Thread Martin J. Evans

On 04/11/11 16:39, David E. Wheeler wrote:

On Nov 4, 2011, at 1:39 AM, Martin J. Evans wrote:


Sorry David, I've been snowed under. I will try very hard to publish the 
research I found this weekend.


Awesome, thanks.

Did you ever get any data from DBD::SQLite folks?


Yes. I found a bug in the process and it was fixed but I have a working SQLite 
example.


I didn't think I was going to make LPW but it seems I will now - although it 
has cost me big time leaving it until the last minute.


Your sacrifice is greatly appreciated. :-)

Best,

David


I'm only really missing DB2 but I have contacts for that on #dbix-class who 
I've just not yet poked.

Martin
--
Martin J. Evans
Easysoft Limited
http://www.easysoft.com


Re: Add Unicode Support to the DBI

2011-11-04 Thread David E. Wheeler
On Nov 4, 2011, at 10:33 AM, Martin J. Evans wrote:

 Did you ever get any data from DBD::SQLite folks?
 
 Yes. I found a bug in the process and it was fixed but I have a working 
 SQLite example.

Oh, great.

 I'm only really missing DB2 but I have contacts for that on #dbix-class who 
 I've just not yet poked.

Cool, thanks for the update.

Best,

David



Re: Add Unicode Support to the DBI

2011-11-03 Thread David E. Wheeler
On Oct 7, 2011, at 5:06 PM, David E. Wheeler wrote:

 Perhaps we could carve out some time at LPW to sit together and try to
 progress this.
 
 That would be awesome you guys!

So gents, do you plan to do this a bit? Martin, do you have the data you wanted 
to collect on this?

Thanks,

David



Re: Add Unicode Support to the DBI

2011-10-13 Thread Greg Sabino Mullane

-BEGIN PGP SIGNED MESSAGE-
Hash: RIPEMD160


David E. Wheeler wrote:
 I think what I haven't said is that we should just use the same 
 names that Perl I/O uses. Er, well, for the :raw and :utf8 
 varieties I was, anyway. Perhaps we should adopt it wholesale, 
 so you'd use :encoding(UTF-8) instead of UTF-8.

That's pretty ugly. I don't think we need to adopt the I/O 
convention, as there is no direct mapping anyway, it just 
confuses the issue.

 For DBD::Pg, at least, if client-encoding is set to Big5, then 
 you *have* to encode to send it to the database. Or change the 
 client encoding, of course.

Not sure I'm following this completely. Or rather, why this should 
be the DBDs role.

 How would one map things - just demand that 
 whatever is given must be a literal encoding the particular database 
 can understand?

 I think we should standardize on the Perl IO names for these things. 
 Some databases may not support them all, of course.

Hm... I don't know enough about the various DB's encodings to see 
how good an idea that is.

 So the above means these two actually behave very differently:
 
 $dbh-{encoding} = ':utf8';
 
 $dbh-{encoding} = 'utf8';
 
 Could be a little confusing, no? Methinks we some long ugly name, maybe 
 even worse than perl_native. Perhaps perl_internal_utf8_flag? 1/2 :)

 No, I think just encoding, and utf8 would be invalid, 
 but :encoding(UTF-8) would not.

Again, ugh. Although a *little* less confusing when contrasting:

$dbh-{encoding} = ':encoding(utf-8)';

$dbh-{encoding} = 'utf8';

 Well, I think we might have to have it with the pg_prefix until 
 this stuff is finalized here. Not sure, though.

That's my point - if we can get it finalized here, we can avoif the 
pg_prefix entirely, rather than add it now and then deprecate it later.

- -- 
Greg Sabino Mullane g...@turnstep.com
PGP Key: 0x14964AC8 201110130902
http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8
-BEGIN PGP SIGNATURE-

iEYEAREDAAYFAk6W4ZQACgkQvJuQZxSWSsiqUQCgo/icUz0enqn0BWSygNSeNJGW
lDsAoMbjgZrsGJyS7kS60RgNNkpXMIjG
=43Q3
-END PGP SIGNATURE-




Re: Add Unicode Support to the DBI

2011-10-13 Thread David E. Wheeler
On Oct 13, 2011, at 6:03 AM, Greg Sabino Mullane wrote:

 I think what I haven't said is that we should just use the same 
 names that Perl I/O uses. Er, well, for the :raw and :utf8 
 varieties I was, anyway. Perhaps we should adopt it wholesale, 
 so you'd use :encoding(UTF-8) instead of UTF-8.
 
 That's pretty ugly. I don't think we need to adopt the I/O 
 convention, as there is no direct mapping anyway, it just 
 confuses the issue.

Sure. In that case, I'd say :utf8, :raw, or $encoding.

 For DBD::Pg, at least, if client-encoding is set to Big5, then 
 you *have* to encode to send it to the database. Or change the 
 client encoding, of course.
 
 Not sure I'm following this completely. Or rather, why this should 
 be the DBDs role.

By default, yes, the DBD should DTRT here. But I think there also ought to be a 
way to tell it what to do.

 How would one map things - just demand that 
 whatever is given must be a literal encoding the particular database 
 can understand?
 
 I think we should standardize on the Perl IO names for these things. 
 Some databases may not support them all, of course.
 
 Hm... I don't know enough about the various DB's encodings to see 
 how good an idea that is.

I assume that it's all over the map, so we should be as general as we can. 
Specifying an encoding by name should cover everything.

 No, I think just encoding, and utf8 would be invalid, 
 but :encoding(UTF-8) would not.
 
 Again, ugh. Although a *little* less confusing when contrasting:
 
 $dbh-{encoding} = ':encoding(utf-8)';
 
 $dbh-{encoding} = 'utf8';

Yeah, or we can go with my original suggestion:

$dbh-{encoding} =  'UTF-8';
$dbh-{encoding} =  ':utf8';

 Well, I think we might have to have it with the pg_prefix until 
 this stuff is finalized here. Not sure, though.
 
 That's my point - if we can get it finalized here, we can avoif the 
 pg_prefix entirely, rather than add it now and then deprecate it later.

Sure. I suspect this is going to take a while, though.

Best,

David




Re: Add Unicode Support to the DBI

2011-10-07 Thread Tim Bunce
On Tue, Oct 04, 2011 at 11:24:51PM +0100, Martin J. Evans wrote:
 On 04/10/2011 22:38, Tim Bunce wrote:
 I've not had time to devote to this thread. Sorry.
 
 I'd be grateful if someone could post a summary of it if/when it
 approaches some kind of consensus.

 I don't think there is a kind of consensus right now (although
 some useful discussion which probably will bear fruit) and I'd
 prefer to work out what unicode support already exists and how it is
 implemented first. [...]
 
 I've started gathering together details of what unicode support
 there is in DBDs, how it is implemented and what special flags there
 are to support it. However, this is a massive task. So far I've done
 ODBC, Oracle, CSV, Unify, mysql, SQLite, Firebird

That seems like enough to make a start on design issues.
There's certainly a wide range of databases in that list :)

 examples I have which verify unicode support

Are the examples in a repo somewhere?

 I thought the whole issue was an interesting topic and I had toyed
 with doing a talk for LPW but to be honest, it is already taking a
 lot of time and I have personal issues right now (and of course my
 $work) which mean my time is severely limited so I'm doubtful right
 now if I could have it ready in time as a talk. I might just post
 what I have gathered in a weeks time in the hope I get a little more
 input in the mean time.

Good idea. I'd like to see the examples sometime to see what scope they
have and if/how the might be improved.

Perhaps we could carve out some time at LPW to sit together and try to
progress this.

Tim.


Re: Add Unicode Support to the DBI

2011-10-07 Thread David E. Wheeler
On Oct 7, 2011, at 1:47 AM, Tim Bunce wrote:

 Perhaps we could carve out some time at LPW to sit together and try to
 progress this.

That would be awesome you guys!

D



Re: Add Unicode Support to the DBI

2011-10-06 Thread Greg Sabino Mullane

-BEGIN PGP SIGNED MESSAGE-
Hash: RIPEMD160


 Uh, say what? Just as I need to

 binmode STDOUT, ':utf8';
 Before sending stuff to STDOUT (that is, turn off the flag), I would 
 expect DBDs to do the same before sending data to the database. 
 Unless, of course, it just works.

I cannot imagine the flag really matters or not. We (Pg) simply dump a 
bunch of chars to the database, and build it by slurping in the string 
character by character until we hit a null. I suppose other databases 
may do things differently, but I can't imagine how/why.

 Yes, very bad example. Let's call it utf8. Forget 'unicode' entirely.

 Yeah, better, though it' just perpetuates Perl's unfortunate use of 
 the term utf8 for internal string representation. Though I suppose 
 that ship has sunk already.

Yep. To paraphrase horribly, Perl's unicode support is the worst, except for 
all the other languages.

 Because it may still need to convert things. See the ODBC discussion.

 Oh, so you're saying it will decode and encode between Perl's internal 
 form and UTF-8, rather than just flip the flag on and off?

Yes, that's a possibility.

 Yes, because you were only talking about utf8 and UTF-8, not any 
 other encodings. Unless I missed something. If the data coming back 
 from the DB is Big5, I may well want to have some way to decode it 
 (and to encode it for write statements).

You mean at the DBD level -  such that you can say to the database, 
I don't care what encoding you stored it as, I want it encoded 
as X when you give it back to me? (update: yes, see below)

 Well, because utf-8 is pretty much a defacto encoding, or at least 
 way, way more popular than things like ucs2. Also, the Perl utf8 
 flag encourages us to put everything into UTF-8.

 Yeah, but again, that might be some reason to call it something else, 
 like perl_native or something. The fact that it happens to be UTF-8 
 should be irrelevant. ER, except, I guess, you still have to know the 
 encoding of the database.

Well, I wouldn't call it irrelevant, but at the end of the day, we can 
call it perl_native, but that's just going to cause people to look it up 
in the docs and then say aha! that means the utf8 flag is on and then 
they have perl_native - utf8 burned into their head. Or worse, 
perl_native - unicode. :)

 * 'A': the default, it means the DBD should do the best thing, which in most 
 cases means setting SvUTF8_on if the data coming back is UTF-8.
 * 'B': (on). The DBD should make every effort to set SvUTF8_on for returned 
 data, even if it thinks it may not be UTF-8.
 * 'C': (off). The DBD should not call SvUTF8_on, regardless of what it 
 thinks the data is.

 I still prefer an encoding attribute that you can set as follows:

 * undef: Default; same as your A.
 * ':utf8': Same as your B:
 * ':raw': Same as your C
 * $encoding: Encode/decode to/from $encoding

I like that. Although the names are still odd. I guess it does map 
though: raw means no utf8 flag. Still not sure about the encode 
'to', but I'll start thinking about how we could implement the 
'from' in DBD::Pg. How would one map things - just demand that 
whatever is given must be a literal encoding the particular database 
can understand?

 With an encoding attribute, you don't need the utf8_flag at all.

Right, +1

So the above means these two actually behave very differently:

$dbh-{encoding} = ':utf8';

$dbh-{encoding} = 'utf8';

Could be a little confusing, no? Methinks we some long ugly name, maybe 
even worse than perl_native. Perhaps perl_internal_utf8_flag? 1/2 :)

Thanks for plugging away at this. My short term goal is to get this finalized 
enough that I can release the next version of DBD::Pg without a 'pg_' prefix 
to control the encoding items.


- -- 
Greg Sabino Mullane g...@turnstep.com
End Point Corporation http://www.endpoint.com/
PGP Key: 0x14964AC8 201110061151
http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8
-BEGIN PGP SIGNATURE-

iEYEAREDAAYFAk6Nz28ACgkQvJuQZxSWSsiWJQCgt/F0r/sCPDa9GuYrGZpZHlQ2
WfYAn0asIYHmPKz1BDfcBo7wLADHmH7N
=eJmk
-END PGP SIGNATURE-




Re: Add Unicode Support to the DBI

2011-10-05 Thread H.Merijn Brand
On Tue, 04 Oct 2011 23:24:51 +0100, Martin J. Evans
martin.ev...@easysoft.com wrote:

 Some might disagree but DB2 is a main 
 one I no longer have access to (please contact me if you use DBD::DB2 
 and are prepared to spare half an hour or so to modify examples I have 
 which verify unicode support). Of course, if you use another DBD and can 
 send me info on unicode support I'd love to hear from you.

There are two DB2 users on PerlMonks, who are rather helpful in testing
areas. Ask [Tanktalus]  [talexb] :)

$ grep -i -w db2  30 FreeNode-#cbstream.log | grep -i -e unicode -e
utf Sep 30 19:28:20 cbstream  [Tanktalus] /me contemplates how to get 
unicode strings back out of db2 ...
Feb 27 16:29:01 cbstream  [talexb] /me returns to fiddling with utf-8 in 
DB2.
Mar 06 17:56:26 cbstream  [talexb] [Tanktalus]: Hey, looks like I solved 
the utf-8 DB2 problem .. just increase the field to 3n from n and it seems to 
work.
Dec 16 21:19:11 cbstream  [Tanktalus] The DB does. Thus, I would expect 
that the DBD *could* request data in whatever encoding it wants. And since it 
would want utf8 to mesh with Perl properly, and every encoding that DB2 
supports has a mapping to utf8, this should be doable.
Dec 16 21:25:24 cbstream  [Tanktalus] [ambrus]: yeah, I would like to see 
it as mandatory for byte strings that are already utf-8 (which the DBD driver 
should be able to tell trivially, at least on DB2), or for 
connections/statements/something where said decoding was requested.

-- 
H.Merijn Brand  http://tux.nl  Perl Monger  http://amsterdam.pm.org/
using 5.00307 through 5.14 and porting perl5.15.x on HP-UX 10.20, 11.00,
11.11, 11.23 and 11.31, OpenSuSE 10.1, 11.0 .. 11.4 and AIX 5.2 and 5.3.
http://mirrors.develooper.com/hpux/   http://www.test-smoke.org/
http://qa.perl.org  http://www.goldmark.org/jeff/stupid-disclaimers/


Re: Add Unicode Support to the DBI

2011-10-04 Thread Tim Bunce
I've not had time to devote to this thread. Sorry.

I'd be grateful if someone could post a summary of it if/when it
approaches some kind of consensus.

Thanks.

Tim.


Re: Add Unicode Support to the DBI

2011-10-04 Thread Martin J. Evans

On 04/10/2011 22:38, Tim Bunce wrote:

I've not had time to devote to this thread. Sorry.

I'd be grateful if someone could post a summary of it if/when it
approaches some kind of consensus.

Thanks.

Tim.
I don't think there is a kind of consensus right now (although some 
useful discussion which probably will bear fruit) and I'd prefer to work 
out what unicode support already exists and how it is implemented first. 
For instance, Pg is very focussed on UTF-8 (as are most DBDs) and yet 
ODBC uses UCS2 under the hood and CSV can use anything you like. 
Greg/David/Postgres seem to have an immediate problem with unicode 
support in Postgres and I can imagine they are keen to resolve it and 
I'd suggest they do it now in the most appropriate way for DBD::Pg. I 
don't see why this should necessarily impact on any discussion as to 
what DBI should_do/should_say as already the DBDs which support unicode 
mostly do it in different ways.


I've started gathering together details of what unicode support there is 
in DBDs, how it is implemented and what special flags there are to 
support it. However, this is a massive task. So far I've done ODBC, 
Oracle, CSV, Unify, mysql, SQLite, Firebird and sort of held off on Pg 
as I knew Greg was working on it. Some might disagree but DB2 is a main 
one I no longer have access to (please contact me if you use DBD::DB2 
and are prepared to spare half an hour or so to modify examples I have 
which verify unicode support). Of course, if you use another DBD and can 
send me info on unicode support I'd love to hear from you.


I thought the whole issue was an interesting topic and I had toyed with 
doing a talk for LPW but to be honest, it is already taking a lot of 
time and I have personal issues right now (and of course my $work) which 
mean my time is severely limited so I'm doubtful right now if I could 
have it ready in time as a talk. I might just post what I have gathered 
in a weeks time in the hope I get a little more input in the mean time.


Martin


Re: Add Unicode Support to the DBI

2011-10-04 Thread Jonathan Leffler
On Tue, Oct 4, 2011 at 15:24, Martin J. Evans martin.ev...@easysoft.comwrote:

 On 04/10/2011 22:38, Tim Bunce wrote:

 I've not had time to devote to this thread. Sorry.

 I'd be grateful if someone could post a summary of it if/when it
 approaches some kind of consensus.

  I don't think there is a kind of consensus right now (although some
 useful discussion which probably will bear fruit) and I'd prefer to work out
 what unicode support already exists and how it is implemented first. For
 instance, Pg is very focussed on UTF-8 (as are most DBDs) and yet ODBC uses
 UCS2 under the hood and CSV can use anything you like. Greg/David/Postgres
 seem to have an immediate problem with unicode support in Postgres and I can
 imagine they are keen to resolve it and I'd suggest they do it now in the
 most appropriate way for DBD::Pg. I don't see why this should necessarily
 impact on any discussion as to what DBI should_do/should_say as already the
 DBDs which support unicode mostly do it in different ways.

 I've started gathering together details of what unicode support there is in
 DBDs, how it is implemented and what special flags there are to support it.
 However, this is a massive task. So far I've done ODBC, Oracle, CSV, Unify,
 mysql, SQLite, Firebird and sort of held off on Pg as I knew Greg was
 working on it. Some might disagree but DB2 is a main one I no longer have
 access to (please contact me if you use DBD::DB2 and are prepared to spare
 half an hour or so to modify examples I have which verify unicode support).
 Of course, if you use another DBD and can send me info on unicode support
 I'd love to hear from you.

 I thought the whole issue was an interesting topic and I had toyed with
 doing a talk for LPW but to be honest, it is already taking a lot of time
 and I have personal issues right now (and of course my $work) which mean my
 time is severely limited so I'm doubtful right now if I could have it ready
 in time as a talk. I might just post what I have gathered in a weeks time in
 the hope I get a little more input in the mean time.



DBD::Informix has had a couple of UTF-8 patches sent and one has been
applied to the code.  The other arrived this morning and has to be
compared.  The attribute names chosen are different, but both contain 'ix_'
as a prefix and UTF-8 in some form or another.

What I'm not sure about is how to test the code.  Creating an Informix
database that has UTF-8 data in it is trivial (well, nearly trivial).  The
difficulty is demonstrating where there were problems before and that the
problems are gone after.

If anyone has suggestions for how to show that UTF8 is working properly - in
the form of a fairly simple test case - I'd be very grateful to receive it
as guidance.

So, DBD::Informix is endeavouring to move forward, knowing that the
underlying database layers (ESQL/C on the client, and the Informix data
server) handle UTF-8 OK, so any problems are in persuading Perl (and perhaps
DBI) to handle it appropriately too.

[...Did I hear a chorus of nice theory - shame about the practice?...]

-- 
Jonathan Leffler jonathan.leff...@gmail.com  #include disclaimer.h
Guardian of DBD::Informix - v2011.0612 - http://dbi.perl.org
Blessed are we who can laugh at ourselves, for we shall never cease to be
amused.


Re: Add Unicode Support to the DBI

2011-10-03 Thread David E . Wheeler
On Oct 2, 2011, at 8:49 PM, Greg Sabino Mullane wrote:

 DEW I assume you also mean to say that data sent *to* the database 
 DEW has the flag turned off, yes?
 
 No: that is undefined. I don't see it as the DBDs job to massage data 
 going into the database. Or at least, I cannot imagine a DBI interface 
 for that.

Uh, say what? Just as I need to

   binmode STDOUT, ':utf8';

Before sending stuff to STDOUT (that is, turn off the flag), I would expect 
DBDs to do the same before sending data to the database. Unless, of course, it 
just works.

 DEW Yeah, maybe should be utf8_flag instead.
 
 Yes, very bad example. Let's call it utf8. Forget 'unicode' entirely.

Yeah, better, though it' just perpetuates Perl's unfortunate use of the term 
utf8 for internal string representation. Though I suppose that ship has 
sunk already.

 Yeah, that last one is the current Postgres plan. Which I think should 
 be best practice and a default DBI expectation.

Agreed.

 DEW DBDs will decode the data as needed.
 DEW I don't understand this sentence. If the flag is 
 DEW flipped, why will it decode?
 
 Because it may still need to convert things. See the ODBC discussion.

Oh, so you're saying it will decode and encode between Perl's internal form and 
UTF-8, rather than just flip the flag on and off?

 GSM If this is set off, the utf8 flag will never be set, and no 
 GSM decoding will be done on data coming back from the database.
 
 DEW What if the data coming back from the database 
 DEW is Big5 and I want to decode it?
 
 Eh? You just asked above why would we ever decode it?

Yes, because you were only talking about utf8 and UTF-8, not any other 
encodings. Unless I missed something. If the data coming back from the DB is 
Big5, I may well want to have some way to decode it (and to encode it for write 
statements).

 DEW You mean never allow it to be flipped when the 
 DEW database encoding is SQL_ASCII?
 
 Yes, basically. But perhaps it does not matter too much. SQL_ASCII 
 is such a bad idea anyway, I feel no need to coddle people using it. :)

+1

 MJE So is the problem that sometimes a DBD does not know what to encode data 
 MJE being sent to the database or how/whether to decode data coming back 
 from 
 MJE the database? and if that is the case do we need some settings in DBI 
 MJE to tell a DBD?
 
 I think that's one of the things that is being argued for, here.

Yes.

 MJE I think this was my point above, i.e., why utf8? databases accept and 
 MJE supply a number of encodings so why have a flag called utf8? are we 
 MJE going to have ucs2, utf16, utf32 flags as well. Surely, it makes more 
 MJE sense to have a flag where you can set the encoding in the same form 
 MJE Encode uses.
 
 Well, because utf-8 is pretty much a defacto encoding, or at least way, way 
 more popular than things like ucs2. Also, the Perl utf8 flag encourages 
 us to put everything into UTF-8.

Yeah, but again, that might be some reason to call it something else, like 
perl_native or something. The fact that it happens to be UTF-8 should be 
irrelevant. ER, except, I guess, you still have to know the encoding of the 
database.

 MJE and what about when the DBD knows you are wrong because the database 
 MJE says it is returning data in encoding X but you ask for Y.
 
 I would assume that the DBD should attempt to convert it to Y if that 
 is what the user wants.

And throw exceptions as appropriate (encoding/decoding failure).

 MJE (examples of DBD flags)
 
 Almost all the examples from DBDs seem to be focusing on the SvUTF8 flag, so 
 perhaps we should start by focusing on that, or at least decoupling that 
 entirely from decoding? If we assume that the default DBI behavior, or more 
 specifically the default behavior for a random DBD someone picks up is 
 flip the flag on if the data is known to be UTF-8, then we can propose a 
 DBI attribute, call it utf8_flag, that has three states:
 
 * 'A': the default, it means the DBD should do the best thing, which in most 
 cases means setting SvUTF8_on if the data coming back is UTF-8.
 * 'B': (on). The DBD should make every effort to set SvUTF8_on for returned 
 data, even if it thinks it may not be UTF-8.
 * 'C': (off). The DBD should not call SvUTF8_on, regardless of what it 
 thinks the data is.

I still prefer an encoding attribute that you can set as follows:

* undef: Default; same as your A.
* ':utf8': Same as your B:
* ':raw': Same as your C
* $encoding: Encode/decode to/from $encoding

 I presume the other half would be an encoding, such that
 $h-{encoding} would basically ask the DBD to make any returned 
 data into that encoding, by hook or by crook.

With an encoding attribute, you don't need the utf8_flag at all.

Best,

David



Re: Add Unicode Support to the DBI

2011-10-02 Thread Greg Sabino Mullane

-BEGIN PGP SIGNED MESSAGE-
Hash: RIPEMD160


From: David E. Wheeler da...@kineticode.com
GSM * $h-{unicode_flag}
GSM If this is set on, data returned from the database is assumed to be 
UTF-8, and 
GSM the utf8 flag will be set.

DEW I assume you also mean to say that data sent *to* the database 
DEW has the flag turned off, yes?

No: that is undefined. I don't see it as the DBDs job to massage data 
going into the database. Or at least, I cannot imagine a DBI interface 
for that.

From: Martin J. Evans martin.ev...@easysoft.com
MJE There is more than one way to encode unicode - not everyone uses 
MJE UTF-8; although some encodings don't support all of unicode.

Right, but I'm talking utf8 here. There are only two things that 
can be done with the strings returned from a database: flip the 
utf8 flag, or convert/decode it to something else. If it's anything 
but utf-8, the utf-8 flag is useless at best, harmful at worse.

DEW Yeah, maybe should be utf8_flag instead.

Yes, very bad example. Let's call it utf8. Forget 'unicode' entirely.

MJE 4) don't decode the data, the DBD knows it is say UTF-8 encoded 
MJE and simply sets the UTF-8 flag (which from what I read is horribly 
MJE flawed but seems to work for me).

Yeah, that last one is the current Postgres plan. Which I think should 
be best practice and a default DBI expectation.

DEW DBDs will decode the data as needed.
DEW I don't understand this sentence. If the flag is 
DEW flipped, why will it decode?

Because it may still need to convert things. See the ODBC discussion.

GSM If this is set off, the utf8 flag will never be set, and no 
GSM decoding will be done on data coming back from the database.

DEW What if the data coming back from the database 
DEW is Big5 and I want to decode it?

Eh? You just asked above why would we ever decode it?

DEW You mean never allow it to be flipped when the 
DEW database encoding is SQL_ASCII?

Yes, basically. But perhaps it does not matter too much. SQL_ASCII 
is such a bad idea anyway, I feel no need to coddle people using it. :)

MJE So is the problem that sometimes a DBD does not know what to encode data 
MJE being sent to the database or how/whether to decode data coming back from 
MJE the database? and if that is the case do we need some settings in DBI 
MJE to tell a DBD?

I think that's one of the things that is being argued for, here.

MJE I think this was my point above, i.e., why utf8? databases accept and 
MJE supply a number of encodings so why have a flag called utf8? are we 
MJE going to have ucs2, utf16, utf32 flags as well. Surely, it makes more 
MJE sense to have a flag where you can set the encoding in the same form 
MJE Encode uses.

Well, because utf-8 is pretty much a defacto encoding, or at least way, way 
more popular than things like ucs2. Also, the Perl utf8 flag encourages 
us to put everything into UTF-8.

MJE and what about when the DBD knows you are wrong because the database 
MJE says it is returning data in encoding X but you ask for Y.

I would assume that the DBD should attempt to convert it to Y if that 
is what the user wants.

MJE DBD::Oracle to my knowledge has no special flags; it just attempts to do 
MJE the right thing but it favours speed so most data that is supposed to be 
MJE UTF-8 encoded has SvUTF8_on set but in one case (error messages) it 
MJE properly and strictly decodes the message so long as your Perl is recent 
MJE enough else it uses SvUTF8_on.

I'm not sure I understand this. It takes UTF-8 errors from the database, 
changes them to something else, and does NOT set SvUTF8?

MJE (examples of DBD flags)

Almost all the examples from DBDs seem to be focusing on the SvUTF8 flag, so 
perhaps we should start by focusing on that, or at least decoupling that 
entirely from decoding? If we assume that the default DBI behavior, or more 
specifically the default behavior for a random DBD someone picks up is 
flip the flag on if the data is known to be UTF-8, then we can propose a 
DBI attribute, call it utf8_flag, that has three states:

* 'A': the default, it means the DBD should do the best thing, which in most 
cases means setting SvUTF8_on if the data coming back is UTF-8.
* 'B': (on). The DBD should make every effort to set SvUTF8_on for returned 
data, even if it thinks it may not be UTF-8.
* 'C': (off). The DBD should not call SvUTF8_on, regardless of what it 
thinks the data is.

I presume the other half would be an encoding, such that
$h-{encoding} would basically ask the DBD to make any returned 
data into that encoding, by hook or by crook.

- -- 
Greg Sabino Mullane g...@turnstep.com
End Point Corporation http://www.endpoint.com/
PGP Key: 0x14964AC8 201110022345
http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8
-BEGIN PGP SIGNATURE-

iEYEAREDAAYFAk6JMKgACgkQvJuQZxSWSsgMsQCfdsB6cBwxEmcjvm1WLi9Khncc
I10AoM+M+UGjHjXrtpcQ2PcQOdmmU/n0
=BuvK
-END PGP SIGNATURE-




Re: Add Unicode Support to the DBI

2011-09-22 Thread Martin J. Evans

On 21/09/11 21:52, Greg Sabino Mullane wrote:

-BEGIN PGP SIGNED MESSAGE-
Hash: RIPEMD160


...

And maybe that's the default. But I should be able to tell it to be pedantic 
when the
data is known to be bad (see, for example data from an SQL_ASCII-encoded 
PostgreSQL database).

...

DBD::Pg's approach is currently broken. Greg is working on fixing it, but for 
compatibility
reasons the fix is non-trivial (an the API might be, too). In a perfect world 
DBD::Pg would
just always do the right thing, as the database tells it what encodings to use 
when you
connect (and *all* data is encoded as such, not just certain data types). But 
the world is
not perfect, there's a lot of legacy stuff.

Greg, care to add any other details?


My thinking on this has changed a bit. See the DBD::Pg in git head for a 
sample, but basically,
DBD::Pg is going to:

* Flip the flag on if the client_encoding is UTF-8 (and server_encoding is not 
SQL_ASCII)
* Flip if off if not

The single switch will be pg_unicode_flag, which will basiccaly override the 
automatic
choice above, just in case you really want your SQL_ASCII byte soup marked as 
utf8 for
some reason, or (more likely), you want your data unmarked as utf8 despite 
being so.

This does rely on PostgreSQL doing the right thing when it comes to 
encoding/decoding/storing
all the encodings, but I'm pretty sure it's doing well in that regard.

...

Since nobody has actally defined a specific interface yet, let me throw out a
straw man. It may look familiar :)

===
* $h-{unicode_flag}

If this is set on, data returned from the database is assumed to be UTF-8, and
the utf8 flag will be set. DBDs will decode the data as needed.


There is more than one way to encode unicode - not everyone uses UTF-8; 
although some encodings don't support all of unicode.

unicode is not encoded as UTF-8 in ODBC using the wide APIs.

Using the wide ODBC APIs returns data in UCS2 encoding and DBD::ODBC decodes 
it. Using the ANSI APIs data is returned as octets and is whatever it is - it 
may be ASCII, it may be UTF-8 encoded (only in 2 cases I know and I believe 
they are flawed anyway) it may be something else in which case the application 
needs to know what it is. In the case of octets which are UTF-8 encoded 
DBD::ODBC has no idea that is the case unless you tell it and it will then set 
the UTF-8 flag (but see later).


If this is set off, the utf8 flag will never be set, and no decoding will be 
done
on data coming back from the database.

If this is not set (undefined), the underlying DBD is responsible for doing the
correct thing. In other words, the behaviour is undefined.
===

I don't think this will fit into DBD::Pgs current implementation perfectly, as
we wouldn't want people to simply leave $h-{unicode_flag} on, as that would
force SQL_ASCII text to have utf8 flipped on. Perhaps we simply never, ever
allow that.


I'm not that familiar with Postgres (I've used a few times and not to any great 
degree) and I used MySQL for a while years ago. I occasionally use SQLite. I do 
use DBD::Oracle and DBD::ODBC all the time. I'm still struggling to see the 
problem that needs fixing. Is it just that some people would like a DBI flag 
which tells the DBD:

1) decode any data coming back from the database strictly such that if it is 
invalid you die
2) decode any data coming back from the database loosely (e.g., utf-8 vs UTF-8)
3) don't decode the data from the database at all
4) don't decode the data, the DBD knows it is say UTF-8 encoded and simply sets 
the UTF-8 flag (which from what I read is horribly flawed but seems to work for 
me).

and the reverse.

DBD::Oracle does 1 some of the time and it does 4 the rest of the time e.g. 
error messages are fully decoded from UTF-8 IF Oracle is sending UTF-8 and it 
does 4 on most of the column data IF Oracle is sending UTF-8.

DBD::ODBC does nothing via the ANSI APIs unless the odbc_utf8 flag is turned on 
in which case it does 4 (and it only does this because there is apparently a 
version of the Postgres ODBC driver out there somewhere that returns UTF-8 
encoded data, I've never seen it, I just accepted the patch).

DBD::ODBC does 1 if using the wide APIs and it has little choice since no one 
would want to accept UCS2 and have to decode it all the time.

My point being, doesn't the DBD know how the data is encoded when it gets it 
from the database? and it would hopefully know what the database needs when 
sending data. Perhaps in some conditions the DBD does not know this and needs 
to be told (I could imagine SQLite reading/writing straight to files for 
instance might want to know to open the file with UTF-8 layer).

So is the problem that sometimes a DBD does not know what to encode data being 
sent to the database or how/whether to decode data coming back from the 
database? and if that is the case do we need some settings in DBI to tell a DBD?

Martin
--
Martin J. Evans
Easysoft Limited
http://www.easysoft.com


Re: Add Unicode Support to the DBI

2011-09-22 Thread Martin J. Evans

David, I forgot to answer your post first and ended up putting most of my 
comments in a reply to Greg's posting - sorry, it was a long night last night. 
Some further comments below:

On 21/09/11 19:44, David E. Wheeler wrote:

On Sep 10, 2011, at 3:08 AM, Martin J. Evans wrote:


I'm not sure any change is required to DBI to support unicode. As
far as I'm aware unicode already works with DBI if the DBDs do the
right thing.


Right, but the problem is that, IME, none of them do the right
thing. As I said, I've submitted encoding-related bug reports for
every DBD I've used in production code. And they all have different
interfaces for tweaking things.


So it is these different interfaces for tweaking things you'd like to 
standardise I presume.


If you stick to the rule that all data Perl receives must be
decoded and all data Perl exports must be encoded it works
(ignoring any issues in Perl itself).


Er, was there supposed to be a , then … statement there?


I bow to Tom's experience but I'm still not sure how that applies
to DBI so long as the interface between the database and Perl
always encodes and decodes then the issues Tom describes are all
Perl ones - no?


The trouble is that:

1. They don't always encode or decode 2. When they do, the tend to
get subtle bits wrong 3. And they all have different interfaces and
philosophies for doing so


Surely Oracle should return the data encoded as you asked for it
and if it did not Oracle is broken. I'd still like to see this case
and then we can see if Oracle is broken and if there is a fix for
it.


Oh I don't doubt that Oracle is broken.


In some places DBD::Oracle does sv_utf8_decode(scalar) or
SvUTF8_on(scalar) (depending on your Perl) and in some places it
just does SvUTF8_on(scalar). I believe the latter is much quicker
as the data is not checked. Many people (myself included) are
particularly interested in DBD::Oracle being fast and if all the
occurrences were changed to decode I'd patch that out in my copy as
I know the data I receive is UTF-8 encoded.


IME It needs an assume Oracle is broken knob. That is, I should
have the option to enface encoding and decoding, rather than just
flipping SvUTF8. And I think that such an interface should be
standardized in the DBI along with detailed information for driver
authors how how to get it right.


ok, I get that.


See above. I'd like the chance to go with speed and take the
consequences rather than go with slower but know incorrect UTF-8 is
spotted.


And maybe that's the default. But I should be able to tell it to be
pedantic when the data is known to be bad (see, for example data from
an SQL_ASCII-encoded PostgreSQL database).


I thought UTF-8 when used in Perl used the strict definition and
utf-8 used Perl's looser definition - see
http://search.cpan.org/~dankogai/Encode-2.44/Encode.pm#UTF-8_vs._utf8_vs._UTF8





That's right. So if I want to ensure that I'm getting strict encoding
in my database, It needs to encode and decode, not simply flip
SvUTF8.


Don't DBDs do this now? I know the encoding of the data I receive
in DBD::ODBC and decode it when I get it and encode it when I send
it and I believe that is what DBD::Oracle does as well. There is
one exception in ODBC for drivers which don't truly abide by ODBC
spec and send 8 bit data back UTF-8 encoded (see later).


There is no single API for configuring this in the DBI, and I argue
there should be.


I've spent a lot of effort getting unicode working in DBD::ODBC
(for UNIX and with patches from Alexander Foken for Windows) which
is implemented in an awkward fashion in ODBC. I'd like to hear from
DBD authors what support they already have and how it is
implemented so we can see what ground is already covered and where
the problems were.


DBD::Pg's approach is currently broken. Greg is working on fixing it,
but for compatibility reasons the fix is non-trivial (an the API
might be, too). In a perfect world DBD::Pg would just always do the
right thing, as the database tells it what encodings to use when you
connect (and *all* data is encoded as such, not just certain data
types). But the world is not perfect, there's a lot of legacy stuff.

Greg, care to add any other details?


as I remain unconvinced a problem exists other than incorrectly
coded DBDs. I'm happy to collate that information. As a start I'll
describe the DBD::ODBC:

1. ODBC has 2 sets of APIs, SQLxxxA (each chr is 8 bits) and
SQLxxxW (each chr is 16 bits and UCS-2). This is how Microsoft did
it and yes I know that does not support all of unicode but code
pages get involved too.

2. You select which API you are using with a macro when you compile
your application so you cannot change your mind. You can in theory
call SQLxxxA or SQLxxxW functions directly but if you use SQLxxx
you get the A or W depending on what the macro is set to. Problem:
DBD::ODBC has to built one way or the other.

3. When using the SQLxxxA functions you can still bind
columns/parameters as wide 

Re: Add Unicode Support to the DBI

2011-09-22 Thread Martin J. Evans

On 22/09/2011 17:36, David E. Wheeler wrote:

On Sep 22, 2011, at 2:26 AM, Martin J. Evans wrote:


There is more than one way to encode unicode - not everyone uses UTF-8; 
although some encodings don't support all of unicode.

Yeah, maybe should be utf8_flag instead.

see below.

unicode is not encoded as UTF-8 in ODBC using the wide APIs.

Using the wide ODBC APIs returns data in UCS2 encoding and DBD::ODBC decodes 
it. Using the ANSI APIs data is returned as octets and is whatever it is - it 
may be ASCII, it may be UTF-8 encoded (only in 2 cases I know and I believe 
they are flawed anyway) it may be something else in which case the application 
needs to know what it is. In the case of octets which are UTF-8 encoded 
DBD::ODBC has no idea that is the case unless you tell it and it will then set 
the UTF-8 flag (but see later).

Right. There needs to be a way to tell the DBI what encoding the server sends 
and expects to be sent. If it's not UTF-8, then the utf8_flag option is kind of 
useless.
I think this was my point above, i.e., why utf8? databases accept and 
supply a number of encodings so why have a flag called utf8? are we 
going to have ucs2, utf16, utf32 flags as well. Surely, it makes more 
sense to have a flag where you can set the encoding in the same form 
Encode uses.

I'm not that familiar with Postgres (I've used a few times and not to any great 
degree) and I used MySQL for a while years ago. I occasionally use SQLite. I do 
use DBD::Oracle and DBD::ODBC all the time. I'm still struggling to see the 
problem that needs fixing. Is it just that some people would like a DBI flag 
which tells the DBD:

1) decode any data coming back from the database strictly such that if it is 
invalid you die
2) decode any data coming back from the database loosely (e.g., utf-8 vs UTF-8)
3) don't decode the data from the database at all
4) don't decode the data, the DBD knows it is say UTF-8 encoded and simply sets 
the UTF-8 flag (which from what I read is horribly flawed but seems to work for 
me).

and the reverse.

Yes, with one API for all drivers, if possible, and guidelines for how it 
should work (when to encode and decode, what to encode and decode, when to just 
flip the utf8 flag on and off, etc.).

ok

DBD::Oracle does 1 some of the time and it does 4 the rest of the time e.g. 
error messages are fully decoded from UTF-8 IF Oracle is sending UTF-8 and it 
does 4 on most of the column data IF Oracle is sending UTF-8.

Yeah, but to enable it *you set a bloody environment variable*. WHAT?
Unless I'm mistaken as to what you refer to I believe that is a feature 
of the Oracle client libraries and not one of DBD::Oracle so there is 
little we can do about that.



My point being, doesn't the DBD know how the data is encoded when it gets it 
from the database? and it would hopefully know what the database needs when 
sending data. Perhaps in some conditions the DBD does not know this and needs 
to be told (I could imagine SQLite reading/writing straight to files for 
instance might want to know to open the file with UTF-8 layer).

Or to turn it off, so you can just pass the encoded UTF-8 through to the file 
without the decode/encode round-trip.


So is the problem that sometimes a DBD does not know what to encode data being 
sent to the database or how/whether to decode data coming back from the 
database? and if that is the case do we need some settings in DBI to tell a DBD?

That's an issue, yes, but the main issue is that all the drivers do it 
differently, sometimes with different semantics, and lack all the functionality 
one might want (e.g., your examples 1-4).

Best,

David



So to try and move forward, we'd we talking about a flag or flags which say:

1 encode the data sent to the database like this (which could be nothing)
2 decode the data retrieved from the database like this (which could 
be nothing but if not nothing it could be using strict or loose for the 
UTF-8 and utf-8 case)
3 don't decode but use SvUTF8_on (a specific case since Perl uses that 
internally and a number of database return UTF-8)

  one that seems to work but I worry about.
4 do what the DBD thinks is best - whatever the behaviour is now?

and what about when it conflicts with your locale/LANG?

and what about PERL_UNICODE flags, do they come into this?

and what about when the DBD knows you are wrong because the database 
says it is returning data in encoding X but you ask for Y.


and for DBD::ODBC built for unicode API am I expected to try and decode 
UCS2 as x just because the flag tells me to and I know it will not work? 
Seems like it only applies to the ANSI API in DBD::ODBC where the data 
could be UTF-8 encoded in a few (possibly broken see 
http://www.martin-evans.me.uk/node/20#unicode) cases.


I still think it would help to name some specific cases per DBD of flags 
in use and why they exist:


DBD::ODBC has a odbc_utf8_on flag to say that data returned by the 
database when using the ANSI APIs is 

Re: Add Unicode Support to the DBI

2011-09-22 Thread David E. Wheeler
On Sep 22, 2011, at 11:14 AM, Martin J. Evans wrote:

 Right. There needs to be a way to tell the DBI what encoding the server 
 sends and expects to be sent. If it's not UTF-8, then the utf8_flag option 
 is kind of useless.
 I think this was my point above, i.e., why utf8? databases accept and supply 
 a number of encodings so why have a flag called utf8? are we going to have 
 ucs2, utf16, utf32 flags as well. Surely, it makes more sense to have a flag 
 where you can set the encoding in the same form Encode uses.

Yes, I agreed with you. :-)

 Unless I'm mistaken as to what you refer to I believe that is a feature of 
 the Oracle client libraries and not one of DBD::Oracle so there is little we 
 can do about that.

Sure you can. I set something via the DBI interface and the DBD sets the 
environment variable for the Oracle client libraries.

 So to try and move forward, we'd we talking about a flag or flags which say:
 
 1 encode the data sent to the database like this (which could be nothing)
 2 decode the data retrieved from the database like this (which could be 
 nothing but if not nothing it could be using strict or loose for the UTF-8 
 and utf-8 case)
 3 don't decode but use SvUTF8_on (a specific case since Perl uses that 
 internally and a number of database return UTF-8)
  one that seems to work but I worry about.
 4 do what the DBD thinks is best - whatever the behaviour is now?

Yes.

 and what about when it conflicts with your locale/LANG?

So what?

 and what about PERL_UNICODE flags, do they come into this?

What are those?

 and what about when the DBD knows you are wrong because the database says it 
 is returning data in encoding X but you ask for Y.

Throw an exception or a warning.

 and for DBD::ODBC built for unicode API am I expected to try and decode UCS2 
 as x just because the flag tells me to and I know it will not work? Seems 
 like it only applies to the ANSI API in DBD::ODBC where the data could be 
 UTF-8 encoded in a few (possibly broken see 
 http://www.martin-evans.me.uk/node/20#unicode) cases.

If the user does something that makes no sense, tell them it makes no sense. 
Die if necessary.

 I still think it would help to name some specific cases per DBD of flags in 
 use and why they exist:
 
 DBD::ODBC has a odbc_utf8_on flag to say that data returned by the database 
 when using the ANSI APIs is UTF-8 encoded and currently it calls SvUTF8_on on 
 that data (I've never used or verified it works myself but the person 
 supplying the patch said it had a purpose with a particular Postgres based 
 database).

That's what the new DBD::Pg flag that Greg's working on does, too.

 Beyond that DBD::ODBC has no other flags as it knows in the unicode/wide APIs 
 the data is UCS2 encoded and it checks it is valid when decoding it. 
 Similarly when sending data to the database in the wide APIs it takes the 
 Perl scalar and encodes it in UCS2.

Yeah, ideally, by default, if the DBD knows the encoding used by the database, 
it should just DTRT. There are backward compatibility issues with that for 
DBD::Pg, though. So there probably should be a knob to say don't do any 
encoding or decoding at all, because a lot of older apps likely expect that.

 DBD::Oracle to my knowledge has no special flags; it just attempts to do the 
 right thing but it favours speed so most data that is supposed to be UTF-8 
 encoded has SvUTF8_on set but in one case (error messages) it properly and 
 strictly decodes the message so long as your Perl is recent enough else it 
 uses SvUTF8_on.
 
 So, what are the other flags in use and what purpose do they fulfill.

I think we could really just start with one flag, encoding. By default the 
DBD should just try to do the right thing. If encoding is set to :raw then 
it should do no encoding or decoding. If it's set to :utf8 it should just 
turn the flag on or off. If it's set to an actual encoding it should encode and 
decode. I think that would be a good start.

Best,

David






Re: Add Unicode Support to the DBI

2011-09-22 Thread David E. Wheeler
On Sep 22, 2011, at 11:57 AM, Martin J. Evans wrote:

 ok except what the oracle client libraries accept does not match with Encode 
 accepted strings so someone would have to come up with some sort of mapping 
 between the two.

Yes. That's one of the consequences of providing a single interface to multiple 
databases.

 and what about when it conflicts with your locale/LANG?
 So what?
 I'm not so sure this is a So what as Perl itself uses locale settings in 
 some cases - just thought it needed mentioning for consideration.

I'm not really concerned about locales at this point. I tend to leave 
collation, for example, up to the database. Right now I'm strictly concerned 
about encoding.

 and what about PERL_UNICODE flags, do they come into this?
 What are those?
 See http://perldoc.perl.org/perlrun.html
 
 In particular UTF-8 is the default PerlIO layer for input streams of which 
 reading data from a database could be considered one?

That'd be cool, but it's not currently implemented that way, obviously. DBI and 
PerlIO are completely independent AFAIK, and the DBI doesn't look like a file 
handle.

 ok, I'm thinking through the ramifications of this.
 
 To add to the list I see DBD::SQLite has |sqlite_unicode |strings coming 
 from the database and passed to the collation function will be properly 
 tagged with the utf8 flag; but this only works if the |sqlite_unicode| 
 attribute is set before the first call to a perl collation sequence and The 
 current FTS3 implementation in SQLite is far from complete with respect to 
 utf8 handling : in particular, variable-length characters are not treated 
 correctly by the builtin functions |offsets()| and |snippet()|.
 
 and DBD::CSV has
 
 f_encoding = utf8,
 
 DBD::mysql has mysql_enable_utf8 which apparently This attribute determines 
 whether DBD::mysql should assume strings stored in the database are utf8. 
 This feature defaults to off.
 
 I could not find any special flags for DBD::DB2.
 
 DBD::Sybase has syb_enable_utf8 If this attribute is set then DBD::Sybase 
 will convert UNIVARCHAR, UNICHAR, and UNITEXT data to Perl's internal utf-8 
 encoding when they are retrieved. Updating a unicode column will cause Sybase 
 to convert any incoming data from utf-8 to its internal utf-16 encoding.


Yeah, so I think that can be generalized.

Best,

David




Re: Add Unicode Support to the DBI

2011-09-21 Thread David E. Wheeler
DBI peeps,

Sorry for the delayed response, I've been busy, looking to reply to this thread 
now.

On Sep 9, 2011, at 8:06 PM, Greg Sabino Mullane wrote:

 One thing I see bandied about a lot is that Perl 5.14 is highly preferred. 
 However, it's not clear exactly what the gains are and how bad 5.12 is 
 compared to 5.14, how bad 5.10 is, how bad 5.8 is, etc. Right now 5.8 is 
 the required minimum for DBI: should we consider bumping this? I know TC 
 would be horrified to see us attempting to talk about Unicode support 
 with a 5.8.1 requirement, but how much of that will affect database 
 drivers? I have no idea myself.

I think I'd just follow TC's recommendations here. DBI should stay compatible 
as far back as is reasonable without unduly affecting further development and 
improvement (not that there's much of that right now). So if proper encoding is 
important to you, use at least 5.12 and prefer 5.14. And if proper encoding is 
not important to you, well, it is, you just don't know it yet.

 Another aspect to think about that came up during some offline DBD::Pg 
 talks was the need to support legacy scripts and legacy data. While the 
 *correct* thing is to blaze forward and use Do Things Correctly everywhere, 
 I think we at least need some prominent knobs so that we can maintain 
 backwards compatiblity for existing scripts that expect a bunch of 
 Latin1, or need the data to come back in the current, undecoded, 
 un-utf8-flagged way.

Agreed. I suspect the existing behavior should remain the default, with a knob 
to make it do things correctly, with perhaps a deprecation plan to turn on 
the correctly knob by default in a year or so.

Best,

David

Re: Add Unicode Support to the DBI

2011-09-21 Thread David E. Wheeler
On Sep 10, 2011, at 7:44 AM, Lyle wrote:

 Right now 5.8 is the required minimum for DBI: should we consider bumping 
 this?
 
 I know a lot of servers in the wild are still running RHEL5 and it's 
 variants, which are stuck on 5.8 in the standard package management. The new 
 RHEL6 only has 5.10...
 So at this time the impact of such change could be significant.

Yes, which is why we can't just impose a solution on people.

Best,

David

Re: Add Unicode Support to the DBI

2011-09-21 Thread David E. Wheeler
On Sep 10, 2011, at 3:08 AM, Martin J. Evans wrote:

 I'm not sure any change is required to DBI to support unicode. As far as I'm 
 aware unicode already works with DBI if the DBDs do the right thing.

Right, but the problem is that, IME, none of them do the right thing. As I 
said, I've submitted encoding-related bug reports for every DBD I've used in 
production code. And they all have different interfaces for tweaking things.

 If you stick to the rule that all data Perl receives must be decoded and all 
 data Perl exports must be encoded it works (ignoring any issues in Perl 
 itself).

Er, was there supposed to be a , then … statement there?

 I bow to Tom's experience but I'm still not sure how that applies to DBI so 
 long as the interface between the database and Perl always encodes and 
 decodes then the issues Tom describes are all Perl ones - no?

The trouble is that:

1. They don't always encode or decode
2. When they do, the tend to get subtle bits wrong
3. And they all have different interfaces and philosophies for doing so

 Surely Oracle should return the data encoded as you asked for it and if it 
 did not Oracle is broken.
 I'd still like to see this case and then we can see if Oracle is broken and 
 if there is a fix for it.

Oh I don't doubt that Oracle is broken.

 In some places DBD::Oracle does sv_utf8_decode(scalar) or SvUTF8_on(scalar) 
 (depending on your Perl) and in some places it just does SvUTF8_on(scalar). I 
 believe the latter is much quicker as the data is not checked. Many people 
 (myself included) are particularly interested in DBD::Oracle being fast and 
 if all the occurrences were changed to decode I'd patch that out in my copy 
 as I know the data I receive is UTF-8 encoded.

IME It needs an assume Oracle is broken knob. That is, I should have the 
option to enface encoding and decoding, rather than just flipping SvUTF8. And I 
think that such an interface should be standardized in the DBI along with 
detailed information for driver authors how how to get it right.

 See above. I'd like the chance to go with speed and take the consequences 
 rather than go with slower but know incorrect UTF-8 is spotted.

And maybe that's the default. But I should be able to tell it to be pedantic 
when the data is known to be bad (see, for example data from an 
SQL_ASCII-encoded PostgreSQL database).

 I thought UTF-8 when used in Perl used the strict definition and utf-8 used 
 Perl's looser definition - see 
 http://search.cpan.org/~dankogai/Encode-2.44/Encode.pm#UTF-8_vs._utf8_vs._UTF8

That's right. So if I want to ensure that I'm getting strict encoding in my 
database, It needs to encode and decode, not simply flip SvUTF8.

 Don't DBDs do this now? I know the encoding of the data I receive in 
 DBD::ODBC and decode it when I get it and encode it when I send it and I 
 believe that is what DBD::Oracle does as well. There is one exception in ODBC 
 for drivers which don't truly abide by ODBC spec and send 8 bit data back 
 UTF-8 encoded (see later).

There is no single API for configuring this in the DBI, and I argue there 
should be.

 I've spent a lot of effort getting unicode working in DBD::ODBC (for UNIX and 
 with patches from Alexander Foken for Windows) which is implemented in an 
 awkward fashion in ODBC. I'd like to hear from DBD authors what support they 
 already have and how it is implemented so we can see what ground is already 
 covered and where the problems were.

DBD::Pg's approach is currently broken. Greg is working on fixing it, but for 
compatibility reasons the fix is non-trivial (an the API might be, too). In a 
perfect world DBD::Pg would just always do the right thing, as the database 
tells it what encodings to use when you connect (and *all* data is encoded as 
such, not just certain data types). But the world is not perfect, there's a lot 
of legacy stuff.

Greg, care to add any other details?

 as I remain unconvinced a problem exists other than incorrectly coded DBDs. 
 I'm happy to collate that information. As a start I'll describe the DBD::ODBC:
 
 1. ODBC has 2 sets of APIs, SQLxxxA (each chr is 8 bits) and SQLxxxW (each 
 chr is 16 bits and UCS-2). This is how Microsoft did it and yes I know that 
 does not support all of unicode but code pages get involved too.
 
 2. You select which API you are using with a macro when you compile your 
 application so you cannot change your mind.
 You can in theory call SQLxxxA or SQLxxxW functions directly but if you use 
 SQLxxx you get the A or W depending on what the macro is set to.
 Problem: DBD::ODBC has to built one way or the other.
 
 3. When using the SQLxxxA functions you can still bind columns/parameters as 
 wide characters but the ODBC driver needs to support this.
 
 4. When using SQLxxxW functions all strings are expected in UCS-2. You can 
 bind columns and parameters as whatever type you like but obviously if you 
 bind a unicode column as SQLCHAR instead of SQLWCHAR you probably get the 

Re: Add Unicode Support to the DBI

2011-09-21 Thread Greg Sabino Mullane

-BEGIN PGP SIGNED MESSAGE-
Hash: RIPEMD160


...
 And maybe that's the default. But I should be able to tell it to be pedantic 
 when the 
 data is known to be bad (see, for example data from an SQL_ASCII-encoded 
 PostgreSQL database).
...
 DBD::Pg's approach is currently broken. Greg is working on fixing it, but for 
 compatibility 
 reasons the fix is non-trivial (an the API might be, too). In a perfect world 
 DBD::Pg would 
 just always do the right thing, as the database tells it what encodings to 
 use when you 
 connect (and *all* data is encoded as such, not just certain data types). But 
 the world is 
 not perfect, there's a lot of legacy stuff.

 Greg, care to add any other details?

My thinking on this has changed a bit. See the DBD::Pg in git head for a 
sample, but basically, 
DBD::Pg is going to:

* Flip the flag on if the client_encoding is UTF-8 (and server_encoding is not 
SQL_ASCII)
* Flip if off if not

The single switch will be pg_unicode_flag, which will basiccaly override the 
automatic 
choice above, just in case you really want your SQL_ASCII byte soup marked as 
utf8 for 
some reason, or (more likely), you want your data unmarked as utf8 despite 
being so.

This does rely on PostgreSQL doing the right thing when it comes to 
encoding/decoding/storing 
all the encodings, but I'm pretty sure it's doing well in that regard.

...

Since nobody has actally defined a specific interface yet, let me throw out a 
straw man. It may look familiar :)

===
* $h-{unicode_flag}

If this is set on, data returned from the database is assumed to be UTF-8, and 
the utf8 flag will be set. DBDs will decode the data as needed.

If this is set off, the utf8 flag will never be set, and no decoding will be 
done 
on data coming back from the database.

If this is not set (undefined), the underlying DBD is responsible for doing the 
correct thing. In other words, the behaviour is undefined.
===

I don't think this will fit into DBD::Pgs current implementation perfectly, as 
we wouldn't want people to simply leave $h-{unicode_flag} on, as that would 
force SQL_ASCII text to have utf8 flipped on. Perhaps we simply never, ever 
allow that.

- -- 
Greg Sabino Mullane g...@turnstep.com
End Point Corporation http://www.endpoint.com/
PGP Key: 0x14964AC8 201109211651
http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8
-BEGIN PGP SIGNATURE-

iEYEAREDAAYFAk56TngACgkQvJuQZxSWSsiIfwCeKMfsg2RYsCzDuwb8FnmZhhbu
8LgAn2TNLuKirq5IDAhlCNmQ3gxbnuq7
=k+Fi
-END PGP SIGNATURE-




Re: Add Unicode Support to the DBI

2011-09-21 Thread David E. Wheeler
On Sep 21, 2011, at 1:52 PM, Greg Sabino Mullane wrote:

 Since nobody has actally defined a specific interface yet, let me throw out a 
 straw man. It may look familiar :)
 
 ===
 * $h-{unicode_flag}
 
 If this is set on, data returned from the database is assumed to be UTF-8, 
 and 
 the utf8 flag will be set.

I assume you also mean to say that data sent *to* the database has the flag 
turned off, yes?

 DBDs will decode the data as needed.

I don't understand this sentence. If the flag is flipped, why will it decode?

 If this is set off, the utf8 flag will never be set, and no decoding will be 
 done 
 on data coming back from the database.

What if the data coming back from the database is Big5 and I want to decode it?

 If this is not set (undefined), the underlying DBD is responsible for doing 
 the 
 correct thing. In other words, the behaviour is undefined.
 ===
 
 I don't think this will fit into DBD::Pgs current implementation perfectly, 
 as 
 we wouldn't want people to simply leave $h-{unicode_flag} on, as that would 
 force SQL_ASCII text to have utf8 flipped on. Perhaps we simply never, ever 
 allow that.

You mean never allow it to be flipped when the database encoding is SQL_ASCII?

Best,

David



Re: Add Unicode Support to the DBI

2011-09-10 Thread H.Merijn Brand
On Sat, 10 Sep 2011 03:06:49 -, Greg Sabino Mullane
g...@turnstep.com wrote:

 One thing I see bandied about a lot is that Perl 5.14 is highly preferred. 
 However, it's not clear exactly what the gains are and how bad 5.12 is 
 compared to 5.14, how bad 5.10 is, how bad 5.8 is, etc. Right now 5.8 is 
 the required minimum for DBI: should we consider bumping this? I know TC 
 would be horrified to see us attempting to talk about Unicode support 
 with a 5.8.1 requirement, but how much of that will affect database 
 drivers? I have no idea myself.

Unicode-6.0 and Unicode improvements in general are *THE* reason for me
(our company) to plan for a 5.10.1 - 5.14.2 update

I use Unicode a lot, and we require 5.8.4 as an absolute minimum when
dealing with Unicode. 5.8.1 is not good enough.

 Another aspect to think about that came up during some offline DBD::Pg 
 talks was the need to support legacy scripts and legacy data. While the 
 *correct* thing is to blaze forward and use Do Things Correctly everywhere, 
 I think we at least need some prominent knobs so that we can maintain 
 backwards compatiblity for existing scripts that expect a bunch of 
 Latin1, or need the data to come back in the current, undecoded, 
 un-utf8-flagged way.
 
 - -- 
 Greg Sabino Mullane g...@turnstep.com

-- 
H.Merijn Brand  http://tux.nl  Perl Monger  http://amsterdam.pm.org/
using 5.00307 through 5.14 and porting perl5.15.x on HP-UX 10.20, 11.00,
11.11, 11.23 and 11.31, OpenSuSE 10.1, 11.0 .. 11.4 and AIX 5.2 and 5.3.
http://mirrors.develooper.com/hpux/   http://www.test-smoke.org/
http://qa.perl.org  http://www.goldmark.org/jeff/stupid-disclaimers/


Re: Add Unicode Support to the DBI

2011-09-10 Thread Martin J. Evans

On 10/09/2011 03:52, David E. Wheeler wrote:

DBIers,

tl;dr: I think it's time to add proper Unicode support to the DBI. What do you 
think it should look like?
I'm not sure any change is required to DBI to support unicode. As far as 
I'm aware unicode already works with DBI if the DBDs do the right thing.


If you stick to the rule that all data Perl receives must be decoded and 
all data Perl exports must be encoded it works (ignoring any issues in 
Perl itself).



Background

I've brought this up a time or two in the past, but a number of things have 
happened lately to make me think that it was again time:

First, on the DBD::Pg list, we've been having a discussion about improving the 
DBD::Pg encoding interface.

   http://www.nntp.perl.org/group/perl.dbd.pg/2011/07/msg603.html

That design discussion followed on the extended discussion in this bug report:

   https://rt.cpan.org/Ticket/Display.html?id=40199

Seems that the pg_enable_utf8 flag that's been in DBD::Pg for a long time is 
rather broken in a few ways. Notably, PostgreSQL sends *all* data back to 
clients in a single encoding -- even binary data (which is usually 
hex-encoded). So it made no sense to only decode certain columns. How to go 
about fixing it, though, and adding a useful interface, has proven a bit tricky.

Then there was Tom Christiansen's StackOverflow comment:

   
stackoverflow.com/questions/6162484/why-does-modern-perl-avoid-utf-8-by-default/6163129#6163129
I bow to Tom's experience but I'm still not sure how that applies to DBI 
so long as the interface between the database and Perl always encodes 
and decodes then the issues Tom describes are all Perl ones - no?

This made me realize that Unicode handling is much trickier than I ever 
realized. But it also emphasized for me how important it is to do everything on 
can to do Unicode right. Tom followed up with a *lot* more detail in three 
OSCON presentations this year, all of which you can read here:

   http://98.245.80.27/tcpc/OSCON2011/index.html

(You're likely gonna want to install the fonts linked at the bottom of that 
page before you read the presentations in HTML).

And finally, I ran into an issue recently with Oracle, where we have an Oracle 
database that should have only UTF-8 data but some row values are actually in 
other encodings. This was a problem because I told DBD::Oracle that the 
encoding was Unicode, and it just blindly turned on the Perl utf8 flag. So I 
got broken data back from the database and then my app crashed when I tried to 
act on a string with the utf8 flag on but containing non-unicode bytes. I 
reported this issue in a DBD::Oracle bug report:

   https://rt.cpan.org/Public/Bug/Display.html?id=70819
Surely Oracle should return the data encoded as you asked for it and if 
it did not Oracle is broken.
I'd still like to see this case and then we can see if Oracle is broken 
and if there is a fix for it.


In some places DBD::Oracle does sv_utf8_decode(scalar) or 
SvUTF8_on(scalar) (depending on your Perl) and in some places it just 
does SvUTF8_on(scalar). I believe the latter is much quicker as the data 
is not checked. Many people (myself included) are particularly 
interested in DBD::Oracle being fast and if all the occurrences were 
changed to decode I'd patch that out in my copy as I know the data I 
receive is UTF-8 encoded.



But all this together leads me to believe that it's time to examine adding 
explicit Unicode support to the DBI. But it needs to be designed as carefully 
as possible to account for a few key points:

* The API must be as straightforward as possible without sacrificing necessary 
flexibility. I think it should mostly stay out of users ways and have 
reasonable defaults. But it should be clear what each knob we offer does and 
how it affects things. Side-effects should be avoided.

* Ability to enforce the correctness of encoding and decoding must be given 
priority. Perl has pretty specific ideas about is and is not Unicode, so we 
should respect that as much as possible. If that means encoding and decoding 
rather than just flipping the utf8 bit, then fine.
See above. I'd like the chance to go with speed and take the 
consequences rather than go with slower but know incorrect UTF-8 is spotted.



* The performance impact must be kept as minimal as possible. So if we can get away with 
just flipping the UTF-8 bit on and off, it should be so. I'm not entirely clear on that, 
though, since Perl's internal representation, called utf8, is not the same 
thing as UTF-8. But if there's an efficient way to convert between the two, then it 
should be adopted. For other encodings, obviously a full encode/decode path must be 
followed.
I thought UTF-8 when used in Perl used the strict definition and utf-8 
used Perl's looser definition - see 
http://search.cpan.org/~dankogai/Encode-2.44/Encode.pm#UTF-8_vs._utf8_vs._UTF8

* Drivers must be able to adopt the API in a straight-forward way. That is to 
say, we need to 

Re: Add Unicode Support to the DBI

2011-09-10 Thread Lyle

On 10/09/2011 04:06, Greg Sabino Mullane wrote:

Right now 5.8 is the required minimum for DBI: should we consider bumping this?


I know a lot of servers in the wild are still running RHEL5 and it's 
variants, which are stuck on 5.8 in the standard package management. The 
new RHEL6 only has 5.10...

So at this time the impact of such change could be significant.


Lyle


Re: Add Unicode Support to the DBI

2011-09-09 Thread Greg Sabino Mullane

-BEGIN PGP SIGNED MESSAGE-
Hash: RIPEMD160


One thing I see bandied about a lot is that Perl 5.14 is highly preferred. 
However, it's not clear exactly what the gains are and how bad 5.12 is 
compared to 5.14, how bad 5.10 is, how bad 5.8 is, etc. Right now 5.8 is 
the required minimum for DBI: should we consider bumping this? I know TC 
would be horrified to see us attempting to talk about Unicode support 
with a 5.8.1 requirement, but how much of that will affect database 
drivers? I have no idea myself.

Another aspect to think about that came up during some offline DBD::Pg 
talks was the need to support legacy scripts and legacy data. While the 
*correct* thing is to blaze forward and use Do Things Correctly everywhere, 
I think we at least need some prominent knobs so that we can maintain 
backwards compatiblity for existing scripts that expect a bunch of 
Latin1, or need the data to come back in the current, undecoded, 
un-utf8-flagged way.

- -- 
Greg Sabino Mullane g...@turnstep.com
End Point Corporation http://www.endpoint.com/
PGP Key: 0x14964AC8 201109092305
http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8
-BEGIN PGP SIGNATURE-

iEYEAREDAAYFAk5q1DkACgkQvJuQZxSWSsj+nQCg9TvEVVrkz8GVvfUlanXJc0X7
vhkAoJEoJFSHXUkcoKI28a7aZH5HUd/s
=Wpyl
-END PGP SIGNATURE-




Re: Add Unicode Support to the DBI

2011-09-09 Thread Darren Duncan
Another wrinkle to this is the fact that identifiers in the database, such as 
column names and such, are also character data, and have an encoding.  So for 
any DBMSs that support Unicode identifiers (as I believe a complete one should, 
even if they have to be quoted in SQL) or identifiers with trans-ASCII 
characters, we have to account for those too, making sure that the various 
Perl-side code correctly matches or doesn't match those identifiers, and so on. 
-- Darren Duncan