Re: Add Unicode Support to the DBI

Martin J. Evans Thu, 22 Sep 2011 11:14:41 -0700

On 22/09/2011 17:36, David E. Wheeler wrote:

On Sep 22, 2011, at 2:26 AM, Martin J. Evans wrote:

There is more than one way to encode unicode - not everyone uses UTF-8; 
although some encodings don't support all of unicode.

Yeah, maybe should be utf8_flag instead.

see below.

unicode is not encoded as UTF-8 in ODBC using the wide APIs.

Using the wide ODBC APIs returns data in UCS2 encoding and DBD::ODBC decodes 
it. Using the ANSI APIs data is returned as octets and is whatever it is - it 
may be ASCII, it may be UTF-8 encoded (only in 2 cases I know and I believe 
they are flawed anyway) it may be something else in which case the application 
needs to know what it is. In the case of octets which are UTF-8 encoded 
DBD::ODBC has no idea that is the case unless you tell it and it will then set 
the UTF-8 flag (but see later).

Right. There needs to be a way to tell the DBI what encoding the server sends 
and expects to be sent. If it's not UTF-8, then the utf8_flag option is kind of 
useless.

I think this was my point above, i.e., why utf8? databases accept andsupply a number of encodings so why have a flag called utf8? are wegoing to have ucs2, utf16, utf32 flags as well. Surely, it makes moresense to have a flag where you can set the encoding in the same formEncode uses.

I'm not that familiar with Postgres (I've used a few times and not to any great 
degree) and I used MySQL for a while years ago. I occasionally use SQLite. I do 
use DBD::Oracle and DBD::ODBC all the time. I'm still struggling to see the 
problem that needs fixing. Is it just that some people would like a DBI flag 
which tells the DBD:

1) decode any data coming back from the database strictly such that if it is 
invalid you die
2) decode any data coming back from the database loosely (e.g., utf-8 vs UTF-8)
3) don't decode the data from the database at all
4) don't decode the data, the DBD knows it is say UTF-8 encoded and simply sets 
the UTF-8 flag (which from what I read is horribly flawed but seems to work for 
me).

and the reverse.

Yes, with one API for all drivers, if possible, and guidelines for how it 
should work (when to encode and decode, what to encode and decode, when to just 
flip the utf8 flag on and off, etc.).

ok

DBD::Oracle does 1 some of the time and it does 4 the rest of the time e.g. 
error messages are fully decoded from UTF-8 IF Oracle is sending UTF-8 and it 
does 4 on most of the column data IF Oracle is sending UTF-8.

Yeah, but to enable it *you set a bloody environment variable*. WHAT?

Unless I'm mistaken as to what you refer to I believe that is a featureof the Oracle client libraries and not one of DBD::Oracle so there islittle we can do about that.

My point being, doesn't the DBD know how the data is encoded when it gets it 
from the database? and it would hopefully know what the database needs when 
sending data. Perhaps in some conditions the DBD does not know this and needs 
to be told (I could imagine SQLite reading/writing straight to files for 
instance might want to know to open the file with UTF-8 layer).

Or to turn it off, so you can just pass the encoded UTF-8 through to the file 
without the decode/encode round-trip.

So is the problem that sometimes a DBD does not know what to encode data being 
sent to the database or how/whether to decode data coming back from the 
database? and if that is the case do we need some settings in DBI to tell a DBD?

That's an issue, yes, but the main issue is that all the drivers do it 
differently, sometimes with different semantics, and lack all the functionality 
one might want (e.g., your examples 1-4).

Best,

David


So to try and move forward, we'd we talking about a flag or flags which say:

1 encode the data sent to the database like "this" (which could be nothing)

2 decode the data retrieved from the database like "this" (which couldbe nothing but if not nothing it could be using strict or loose for theUTF-8 and utf-8 case)3 don't decode but use SvUTF8_on (a specific case since Perl uses thatinternally and a number of database return UTF-8)

  one that seems to work but I worry about.
4 do what the DBD thinks is best - whatever the behaviour is now?

and what about when it conflicts with your locale/LANG?

and what about PERL_UNICODE flags, do they come into this?

and what about when the DBD knows you are wrong because the databasesays it is returning data in encoding X but you ask for Y.

and for DBD::ODBC built for unicode API am I expected to try and decodeUCS2 as x just because the flag tells me to and I know it will not work?Seems like it only applies to the ANSI API in DBD::ODBC where the datacould be UTF-8 encoded in a few (possibly broken seehttp://www.martin-evans.me.uk/node/20#unicode) cases.

I still think it would help to name some specific cases per DBD of flagsin use and why they exist:

DBD::ODBC has a odbc_utf8_on flag to say that data returned by thedatabase when using the ANSI APIs is UTF-8 encoded and currently itcalls SvUTF8_on on that data (I've never used or verified it worksmyself but the person supplying the patch said it had a purpose with aparticular Postgres based database). Beyond that DBD::ODBC has no otherflags as it knows in the unicode/wide APIs the data is UCS2 encoded andit checks it is valid when decoding it. Similarly when sending data tothe database in the wide APIs it takes the Perl scalar and encodes it inUCS2.

DBD::Oracle to my knowledge has no special flags; it just attempts to dothe right thing but it favours speed so most data that is supposed to beUTF-8 encoded has SvUTF8_on set but in one case (error messages) itproperly and strictly decodes the message so long as your Perl is recentenough else it uses SvUTF8_on.


So, what are the other flags in use and what purpose do they fulfil.

Martin

Re: Add Unicode Support to the DBI

Reply via email to