Re: Add Unicode Support to the DBI

David E. Wheeler Thu, 22 Sep 2011 11:28:57 -0700

On Sep 22, 2011, at 11:14 AM, Martin J. Evans wrote:

>> Right. There needs to be a way to tell the DBI what encoding the server 
>> sends and expects to be sent. If it's not UTF-8, then the utf8_flag option 
>> is kind of useless.
> I think this was my point above, i.e., why utf8? databases accept and supply 
> a number of encodings so why have a flag called utf8? are we going to have 
> ucs2, utf16, utf32 flags as well. Surely, it makes more sense to have a flag 
> where you can set the encoding in the same form Encode uses.


Yes, I agreed with you. :-)

> Unless I'm mistaken as to what you refer to I believe that is a feature of 
> the Oracle client libraries and not one of DBD::Oracle so there is little we 
> can do about that.

Sure you can. I set something via the DBI interface and the DBD sets the 
environment variable for the Oracle client libraries.

> So to try and move forward, we'd we talking about a flag or flags which say:
> 
> 1 encode the data sent to the database like "this" (which could be nothing)
> 2 decode the data retrieved from the database like "this" (which could be 
> nothing but if not nothing it could be using strict or loose for the UTF-8 
> and utf-8 case)
> 3 don't decode but use SvUTF8_on (a specific case since Perl uses that 
> internally and a number of database return UTF-8)
>  one that seems to work but I worry about.
> 4 do what the DBD thinks is best - whatever the behaviour is now?

Yes.

> and what about when it conflicts with your locale/LANG?

So what?

> and what about PERL_UNICODE flags, do they come into this?

What are those?

> and what about when the DBD knows you are wrong because the database says it 
> is returning data in encoding X but you ask for Y.

Throw an exception or a warning.

> and for DBD::ODBC built for unicode API am I expected to try and decode UCS2 
> as x just because the flag tells me to and I know it will not work? Seems 
> like it only applies to the ANSI API in DBD::ODBC where the data could be 
> UTF-8 encoded in a few (possibly broken see 
> http://www.martin-evans.me.uk/node/20#unicode) cases.

If the user does something that makes no sense, tell them it makes no sense. 
Die if necessary.

> I still think it would help to name some specific cases per DBD of flags in 
> use and why they exist:
> 
> DBD::ODBC has a odbc_utf8_on flag to say that data returned by the database 
> when using the ANSI APIs is UTF-8 encoded and currently it calls SvUTF8_on on 
> that data (I've never used or verified it works myself but the person 
> supplying the patch said it had a purpose with a particular Postgres based 
> database).

That's what the new DBD::Pg flag that Greg's working on does, too.

> Beyond that DBD::ODBC has no other flags as it knows in the unicode/wide APIs 
> the data is UCS2 encoded and it checks it is valid when decoding it. 
> Similarly when sending data to the database in the wide APIs it takes the 
> Perl scalar and encodes it in UCS2.

Yeah, ideally, by default, if the DBD knows the encoding used by the database, 
it should just DTRT. There are backward compatibility issues with that for 
DBD::Pg, though. So there probably should be a knob to say "don't do any 
encoding or decoding at all", because a lot of older apps likely expect that.

> DBD::Oracle to my knowledge has no special flags; it just attempts to do the 
> right thing but it favours speed so most data that is supposed to be UTF-8 
> encoded has SvUTF8_on set but in one case (error messages) it properly and 
> strictly decodes the message so long as your Perl is recent enough else it 
> uses SvUTF8_on.
> 
> So, what are the other flags in use and what purpose do they fulfill.

I think we could really just start with one flag, "encoding". By default the 
DBD should just try to do the right thing. If "encoding" is set to ":raw" then 
it should do no encoding or decoding. If it's set to ":utf8" it should just 
turn the flag on or off. If it's set to an actual encoding it should encode and 
decode. I think that would be a good start.

Best,

David

Re: Add Unicode Support to the DBI

Reply via email to