Re: What to do with UTF-8 data?

Steve Hay Thu, 11 Sep 2003 04:31:22 -0700

Tim Bunce wrote:

On Thu, Sep 11, 2003 at 08:29:50AM +0200, Jochen Wiedmann wrote:

Hi, Steve,

The problem is: How do I trap all input/output to/from DBI to do these conversions?

I've asked about this on the dbi-users mailing list, and the answer (from Tim Bunce, no less) was that it is really the responsibility of the DBD driver to perform such conversions if the data in question is UTF-8.

That's not quite right. I wasn't talking about any _conversions_ at all.

I'm sorry if I mis-quoted you. I meant setting the UTF-8 flag on an octet sequence that can be interpreted as UTF-8, rather than leaving it unflagged and treated as Latin-1. Thus, the data is in some sense "converted" from Latin-1 to UTF-8.

after letting my thoughts settle I come to the conclusion that I do not agree completely. I think that DBI should do 80% of the job and leave about 20% to the driver authors.

For a "full solution" yes, I agree - and I've written about this in the past.
For now I'm just talking about the specific but fairly common
situation of fetching data that is utf8 encoded but it doesn't
get flagged as such by the driver.
For that case the driver just needs to know when to do a SvUTF8_on(sv).

Exactly.

What about data going _into_ the database? In my examples of doing the conversion manually with Encode::{en|de}code_utf8(), I was converting the Perl strings to octet sequences that could later be interpreted as UTF-8 before insertion into the database. That way I could guarantee that all data retrieved from the database can be converted to UTF-8, in fact (as you pointed out) by simply turning the UTF-8 flag on.

If all the data that I insert really is UTF-8 then I guess it will just get "serialised" as a sequence of octets, and everything will be OK.

But what if the data I'm inserting isn't all UTF-8? The problem is:

1. Perl's internal format isn't just UTF-8 -- it defaults to Latin-1 (or whatever) for strings in which every character can be represented in Latin-1; 2. The "8-bit" characters of Latin-1 are represented as two-byte characters in UTF-8.

So, if I have the string "Copyright © Fred Bloggs" in Perl then it will not be UTF-8: the © is stored as one byte, not two not, and the UTF-8 flag is off. If I insert that straight into the database without running it through Encode::encode_utf8() first, then © itself, rather than its two-byte UTF-8 representation gets stored in the database, so when it gets retrieved from the database later you can't just turn the UTF-8 flag on -- you would need to run it through Encode::decode_utf8().

In other words, just having the driver switching the UTF-8 on and off will only work if I guarantee that all the strings I feed it to start with really are UTF-8, even when Perl would not normally have represented them as such.

It would be cool if something akin to "binmode STDOUT, ':utf8';" could be applied when sending data to the driver -- i.e. my data is in "Perl's internal format", whether that be Latin-1 or UTF-8 in the case of the string at hand, and it all gets automagically upgraded to UTF-8 if necessary before insertion into the database. Then you only need to turn the flag on when retrieving it again.

At least, I think that's what I want :-s

- Steve

Re: What to do with UTF-8 data?

Reply via email to