RE: Adding utf8 support to DBD::mysql

Martin J. Evans Sun, 30 Apr 2006 12:17:40 -0700

Shamelessy stolen from Michael Kröll and against a patch on dbd::mysql
on subversion I've sent to Patrick which is not committed yet:


@@ -2915,6 +2914,9 @@
           if (dbis->debug >= 2)
             PerlIO_printf(DBILOGFP, "st_fetch string data %s\n", fbh->data);
           sv_setpvn(sv, fbh->data, fbh->length);
+          if (is_high_bit_set(fbh->data)) {
+              SvUTF8_on(sv);
+          }
           break;
@@ -2968,6 +2970,10 @@
           {    --len; }
         }
         sv_setpvn(sv, col, len);
+        if (is_high_bit_set(col)) {
+            SvUTF8_on(sv);
+        }
+        

@@ -3881,3 +3901,11 @@
   return
sv_2mortal(my_ulonglong2str(mysql_insert_id(&((imp_dbh_t*)imp_dbh)->mysql)));
 }
 #endif
+
+int is_high_bit_set(val)
+                char *val;
+{
+    while (*val)
+        if (*val++ & 0x80) return 1;
+    return 0;
+}


would appear to make the previous code (below and without the Encode) I posted
work fine.

I hasten to add I've not tested this much (other than the posted test code).

What I'm personally interested in is whether retrieving utf data from
DBD::mysql and pushing it through JSON, JSON->objToJson works and I cannot
easily test this until Tuesday.

If Patrick can commit my previous changes I'll test this more.

Martin
--
Martin J. Evans
Easysoft Ltd, UK
http://www.easysoft.com


On 30-Apr-2006 Martin J. Evans wrote:
> I hope I'm not muddying the waters but dbd::mysql UTF support for returned
> data (not metadata) seems to be nearly there. I need UTF8 support on
> at least inserted and returned results-sets although I'm less bothered
> by UTF8 table/column names. It would seem that if you define your table as
> using UTF8 then insertion is not a problem but retrieval is. The
> following code nearly works - it is just the setting of the utf8 flag on
> the returned data that is wrong:
> 
>#!/usr/bin/perl -w
> use strict;
> use DBI qw(:utils);
> use charnames ':full';
> use Encode;
> 
> print "Is utf8::is_utf8 defined: ", defined &utf8::is_utf8, "\n";
> print "Is utf8::valid defined: ", defined &utf8::valid, "\n";
> my $str = "\x{263a}xxx" . chr(0x05d0) . "\N{ARABIC LETTER ALEF}"; # smiley
> print join(" ", unpack("H*", $str)), "\n";
> print "length(str) = ", length($str), "\n";
> print "bytes::length(str) = ", bytes::length($str), "\n";
> 
> print "utf8::is_utf8 = ", utf8::is_utf8($str) ? 1 : 0, "\n";
> print "data_string_desc: ", data_string_desc($str),"\n";
> 
> open OUT, ">uni.out";
> binmode(OUT, ":utf8");
> print OUT "$str\n";
># data written to uni.out is UTF8
> 
> my $dbh = DBI->connect("dbi:mysql:test", "xxx", "xxx");
># there are posts on dbi-user as to whether both or either of
># the following should be set
> $dbh->do("set character set utf8");
> $dbh->do("set names utf8");
> $dbh->do("drop table if exists utf");
> $dbh->do("create table utf (a char(100)) default charset utf8");
> my $sth = $dbh->prepare("insert into utf values (?)");
> $sth->execute($str);
> $sth = $dbh->prepare("select * from utf");
> $sth->execute;
> 
> my @row = $sth->fetchrow_array;
> print "data_string_desc (after fetch): ", data_string_desc($row[0]),"\n";
># the following shows we'e got the right data back
># but perl does not know it is utf8
> print join(" ", unpack("H*", $row[0])), "\n";
># turning on utf8 causes the rignt uf8 sequence to be output
># and hence sv_utf8_upgrade(sv) will probably work
> Encode::_utf8_on($row[0]);
> print "data_string_desc (after fetch): ", data_string_desc($row[0]),"\n";
> 
> open OUT, ">utf.out";
> binmode (OUT, ":utf8");
> print OUT $row[0];
> close OUT;
># data written to utf.out is not UTF8 unless is marked utf8
> 
> produces:
> 
> Is utf8::is_utf8 defined: 1
> Is utf8::valid defined: 1
> e298ba787878d790d8a7
> length(str) = 6
> bytes::length(str) = 10
> utf8::is_utf8 = 1
> data_string_desc: UTF8 on, non-ASCII, 6 characters 10 bytes
> data_string_desc (after fetch): UTF8 off, non-ASCII, 10 characters 10 bytes
> e298ba787878d790d8a7
> 
> with utf.out containing:
> 
> <C3><A2><C2><98><C2><BA>xxx<C3><97><C2><90><C3><98><C2><A7>
> 
> without that Encode::_utf8_on($row[0]);
> 
> Michael Kröll (apr-10-2006)
> posted a change for DBD::db2 which seemed to sort this out for DB2
> so a similar change could be added to mysql.
> 
> Hope this helps.
> 
> Martin
> --
> Martin J. Evans
> Easysoft Ltd, UK
> http://www.easysoft.com
> 
> 
> On 24-Apr-2006 Tim Bunce wrote:
>> [I'm at the mysql conference and Patrick asked me about adding utf8
>> support to DBD::mysql. I said I'd look at the libmysql docs and give my
>> thoughts. I'm posting to dbi-dev since it may be of interest to others
>> interested in enhancing DBD::mysql and to other driver developers.
>> These are just random thoughts from a quick look at the docs.]
>> 
>> The keys mysql docs seem to be
>> http://dev.mysql.com/doc/refman/4.1/en/charset-connection.html
>> 
>> The mysql api and client->server protocol doesn't support passing
>> characterset info to the server on a per-statement / per-bind value basis.
>> (http://dev.mysql.com/doc/refman/4.1/en/c-api-prepared-statement-datatypes.ht
>> m
>> l)
>> So the sane way to send utf8 to the server is by setting the 'connection
>> character set' to utf8 and then only sending utf8 (or its ASCII subset)
>> to the server on that connection.
>> 
>> *** Fetching data:
>> 
>> MySQL 4.1.0 added "unsigned int charsetnr" to the MYSQL_FIELD structure.
>> It's the "character set number for the field".
>> 
>> So set the UTF8 flag based on that value. Something like:
>>     (field->charsetnr = ???) ? SvUTF8_on(sv) : SvUTF8_off(sv);
>> I couldn't see any docs for the values of the charsetnr field.
>> 
>> Also, would be good to enable perl code to access the charsetnr values:
>>     $sth->{mysql_charsetnr}->[$i]
>> 
>> *** Fetching Metadata:
>> 
>> The above is a minimum. It doesn't address metadata like field names
>> ($sth->{NAME}) that might also be in utf8. For that the driver needs to
>> know if the 'connection character set' is currently utf8.
>> 
>> (The docs mention mysql->charset but it's not clear if that's part of
>> the public API.)
>> 
>> However it's detected, the code needs to end up doing:
>>     (...connection charset is utf8...) ? SvUTF8_on(sv) : SvUTF8_off(sv);
>> on the metadata.
>> 
>> 
>> *** SET NAMES '...'
>> 
>> Intercept SET NAMES and call the mysql_set_character_set() API instead.
>> See http://dev.mysql.com/doc/refman/4.1/en/mysql-set-character-set.html
>> 
>> 
>> *** Detecting Inconsistencies
>> 
>> If the connection character set is _not_ utf8 but the application calls
>> the driver with data (or SQL statement) that has the UTF8 flag set, then
>> it could issue a warning. In practice that may be to be too noisy for
>> people that done their own workarounds for utf8 support. If so then
>> they could be changes to level 1 trace messages.
>> 
>> If the connection character set _is_ utf8, and the application calls
>> the driver with data (or SQL statement) that does _not_ have the UTF8
>> flag set but _does_ have bytes with the high bit set, then the driver
>> should issue a warning. The checking for high bit set is an extra cost
>> so this should only be enabled if tracing and/or an attribute is set
>> (perhaps called $dbh->{mysql_charset_checks} = 1)
>> 
>> Tim.

RE: Adding utf8 support to DBD::mysql

Reply via email to