Re: Add Unicode Support to the DBI

Martin J. Evans Thu, 22 Sep 2011 11:58:06 -0700

On 22/09/2011 19:28, David E. Wheeler wrote:

On Sep 22, 2011, at 11:14 AM, Martin J. Evans wrote:

Right. There needs to be a way to tell the DBI what encoding the server sends 
and expects to be sent. If it's not UTF-8, then the utf8_flag option is kind of 
useless.

I think this was my point above, i.e., why utf8? databases accept and supply a 
number of encodings so why have a flag called utf8? are we going to have ucs2, 
utf16, utf32 flags as well. Surely, it makes more sense to have a flag where 
you can set the encoding in the same form Encode uses.

Yes, I agreed with you. :-)

progress then ;-)

Unless I'm mistaken as to what you refer to I believe that is a feature of the 
Oracle client libraries and not one of DBD::Oracle so there is little we can do 
about that.

Sure you can. I set something via the DBI interface and the DBD sets the 
environment variable for the Oracle client libraries.

ok except what the oracle client libraries accept does not match withEncode accepted strings so someone would have to come up with some sortof mapping between the two.


e.g., my current setting is:

NLS_LANG=AMERICAN_AMERICA.AL32UTF8

So to try and move forward, we'd we talking about a flag or flags which say:

1 encode the data sent to the database like "this" (which could be nothing)
2 decode the data retrieved from the database like "this" (which could be 
nothing but if not nothing it could be using strict or loose for the UTF-8 and utf-8 case)
3 don't decode but use SvUTF8_on (a specific case since Perl uses that 
internally and a number of database return UTF-8)
  one that seems to work but I worry about.
4 do what the DBD thinks is best - whatever the behaviour is now?

Yes.

and what about when it conflicts with your locale/LANG?

So what?

I'm not so sure this is a "So what" as Perl itself uses locale settingsin some cases - just thought it needed mentioning for consideration.

and what about PERL_UNICODE flags, do they come into this?

What are those?

See http://perldoc.perl.org/perlrun.html

In particular "UTF-8 is the default PerlIO layer for input streams" ofwhich reading data from a database could be considered one?

and what about when the DBD knows you are wrong because the database says it is 
returning data in encoding X but you ask for Y.

Throw an exception or a warning.

ok

and for DBD::ODBC built for unicode API am I expected to try and decode UCS2 as 
x just because the flag tells me to and I know it will not work? Seems like it 
only applies to the ANSI API in DBD::ODBC where the data could be UTF-8 encoded 
in a few (possibly broken see http://www.martin-evans.me.uk/node/20#unicode) 
cases.

If the user does something that makes no sense, tell them it makes no sense. 
Die if necessary.

I still think it would help to name some specific cases per DBD of flags in use 
and why they exist:

DBD::ODBC has a odbc_utf8_on flag to say that data returned by the database 
when using the ANSI APIs is UTF-8 encoded and currently it calls SvUTF8_on on 
that data (I've never used or verified it works myself but the person supplying 
the patch said it had a purpose with a particular Postgres based database).

That's what the new DBD::Pg flag that Greg's working on does, too.

Beyond that DBD::ODBC has no other flags as it knows in the unicode/wide APIs 
the data is UCS2 encoded and it checks it is valid when decoding it. Similarly 
when sending data to the database in the wide APIs it takes the Perl scalar and 
encodes it in UCS2.

Yeah, ideally, by default, if the DBD knows the encoding used by the database, it should 
just DTRT. There are backward compatibility issues with that for DBD::Pg, though. So 
there probably should be a knob to say "don't do any encoding or decoding at 
all", because a lot of older apps likely expect that.

DBD::Oracle to my knowledge has no special flags; it just attempts to do the 
right thing but it favours speed so most data that is supposed to be UTF-8 
encoded has SvUTF8_on set but in one case (error messages) it properly and 
strictly decodes the message so long as your Perl is recent enough else it uses 
SvUTF8_on.

So, what are the other flags in use and what purpose do they fulfill.

I think we could really just start with one flag, "encoding". By default the DBD should just try to do the 
right thing. If "encoding" is set to ":raw" then it should do no encoding or decoding. If it's set 
to ":utf8" it should just turn the flag on or off. If it's set to an actual encoding it should encode and 
decode. I think that would be a good start.

Best,

David


ok, I'm thinking through the ramifications of this.

To add to the list I see DBD::SQLite has |sqlite_unicode |"stringscoming from the database and passed to the collation function will beproperly tagged with the utf8 flag; but this only works if the|sqlite_unicode| attribute is set before the first call to a perlcollation sequence" and "The current FTS3 implementation in SQLite isfar from complete with respect to utf8 handling : in particular,variable-length characters are not treated correctly by the builtinfunctions |offsets()| and |snippet()|."


and DBD::CSV has

f_encoding => "utf8",

DBD::mysql has mysql_enable_utf8 which apparently "This attributedetermines whether DBD::mysql should assume strings stored in thedatabase are utf8. This feature defaults to off."


I could not find any special flags for DBD::DB2.

DBD::Sybase has syb_enable_utf8 "If this attribute is set thenDBD::Sybase will convert UNIVARCHAR, UNICHAR, and UNITEXT data to Perl'sinternal utf-8 encoding when they are retrieved. Updating a unicodecolumn will cause Sybase to convert any incoming data from utf-8 to itsinternal utf-16 encoding."


Martin

Re: Add Unicode Support to the DBI

Reply via email to