Re: Add Unicode Support to the DBI

Greg Sabino Mullane Sun, 02 Oct 2011 20:49:44 -0700

-----BEGIN PGP SIGNED MESSAGE-----
Hash: RIPEMD160


From: "David E. Wheeler" <da...@kineticode.com>
GSM>> * $h->{unicode_flag}
GSM>> If this is set on, data returned from the database is assumed to be 
UTF-8, and 
GSM>> the utf8 flag will be set.

DEW> I assume you also mean to say that data sent *to* the database 
DEW> has the flag turned off, yes?

No: that is undefined. I don't see it as the DBDs job to massage data 
going into the database. Or at least, I cannot imagine a DBI interface 
for that.

From: "Martin J. Evans" <martin.ev...@easysoft.com>
MJE> There is more than one way to encode unicode - not everyone uses 
MJE> UTF-8; although some encodings don't support all of unicode.

Right, but I'm talking utf8 here. There are only two things that 
can be done with the strings returned from a database: flip the 
utf8 flag, or convert/decode it to something else. If it's anything 
but utf-8, the utf-8 flag is useless at best, harmful at worse.

DEW> Yeah, maybe should be utf8_flag instead.

Yes, very bad example. Let's call it utf8. Forget 'unicode' entirely.

MJE> 4) don't decode the data, the DBD knows it is say UTF-8 encoded 
MJE> and simply sets the UTF-8 flag (which from what I read is horribly 
MJE> flawed but seems to work for me).

Yeah, that last one is the current Postgres plan. Which I think should 
be best practice and a default DBI expectation.

DEW> DBDs will decode the data as needed.
DEW> I don't understand this sentence. If the flag is 
DEW> flipped, why will it decode?

Because it may still need to convert things. See the ODBC discussion.

GSM>> If this is set off, the utf8 flag will never be set, and no 
GSM>> decoding will be done on data coming back from the database.

DEW> What if the data coming back from the database 
DEW> is Big5 and I want to decode it?

Eh? You just asked above why would we ever decode it?

DEW> You mean never allow it to be flipped when the 
DEW> database encoding is SQL_ASCII?

Yes, basically. But perhaps it does not matter too much. SQL_ASCII 
is such a bad idea anyway, I feel no need to coddle people using it. :)

MJE> So is the problem that sometimes a DBD does not know what to encode data 
MJE> being sent to the database or how/whether to decode data coming back from 
MJE> the database? and if that is the case do we need some settings in DBI 
MJE> to tell a DBD?

I think that's one of the things that is being argued for, here.

MJE> I think this was my point above, i.e., why utf8? databases accept and 
MJE> supply a number of encodings so why have a flag called utf8? are we 
MJE> going to have ucs2, utf16, utf32 flags as well. Surely, it makes more 
MJE> sense to have a flag where you can set the encoding in the same form 
MJE> Encode uses.

Well, because utf-8 is pretty much a defacto encoding, or at least way, way 
more popular than things like ucs2. Also, the Perl utf8 flag encourages 
us to put everything into UTF-8.

MJE> and what about when the DBD knows you are wrong because the database 
MJE> says it is returning data in encoding X but you ask for Y.

I would assume that the DBD should attempt to convert it to Y if that 
is what the user wants.

MJE> DBD::Oracle to my knowledge has no special flags; it just attempts to do 
MJE> the right thing but it favours speed so most data that is supposed to be 
MJE> UTF-8 encoded has SvUTF8_on set but in one case (error messages) it 
MJE> properly and strictly decodes the message so long as your Perl is recent 
MJE> enough else it uses SvUTF8_on.

I'm not sure I understand this. It takes UTF-8 errors from the database, 
changes them to something else, and does NOT set SvUTF8?

MJE> (examples of DBD flags)

Almost all the examples from DBDs seem to be focusing on the SvUTF8 flag, so 
perhaps we should start by focusing on that, or at least decoupling that 
entirely from decoding? If we assume that the default DBI behavior, or more 
specifically the default behavior for a random DBD someone picks up is 
"flip the flag on if the data is known to be UTF-8", then we can propose a 
DBI attribute, call it utf8_flag, that has three states:

* 'A': the default, it means the DBD should do the best thing, which in most 
cases means setting SvUTF8_on if the data coming back is UTF-8.
* 'B': (on). The DBD should make every effort to set SvUTF8_on for returned 
data, even if it thinks it may not be UTF-8.
* 'C': (off). The DBD should not call SvUTF8_on, regardless of what it 
thinks the data is.

I presume the other half would be an encoding, such that
$h->{encoding} would basically ask the DBD to make any returned 
data into that encoding, by hook or by crook.

- -- 
Greg Sabino Mullane g...@turnstep.com
End Point Corporation http://www.endpoint.com/
PGP Key: 0x14964AC8 201110022345
http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8
-----BEGIN PGP SIGNATURE-----

iEYEAREDAAYFAk6JMKgACgkQvJuQZxSWSsgMsQCfdsB6cBwxEmcjvm1WLi9Khncc
I10AoM+M+UGjHjXrtpcQ2PcQOdmmU/n0
=BuvK
-----END PGP SIGNATURE-----

Re: Add Unicode Support to the DBI

Reply via email to