Re: DBI and character sets (yet again)

Dean Arnold Sun, 21 Mar 2004 16:51:51 -0800

> 
> > Is there a consistent charset encoding behavior defined for
> > DBI at this time ?
> 
> No.
> 
> > If not, is a rule wrt charset encoding behavior needed ? 
> 
> Yes.
> 
> > If a list of charset behaviors for each DBD is needed,
> > I'd be happy to put one together, assuming the DBD authors
> > send me the details for each driver.
> 
> That would be great.


OK. Shall we start w/ DBD::Oracle ? ;^)
And driver authors, feel free to forward to me (and/or thlis
list). I'll try to put together a little webpage with the info.

> 
> 1. Most applications only work with one character set encoding
>    (not counting UTF8). Obvious example: Latin-1.

Agreed.

> 
> 2. Unicode is where we're going. Get used to it.
> 


Agreed.

> 3. I don't really want the DBI to be involved in any recoding
>    of character sets (from client charset to server charset)
>    and I suggest that the drivers don't try to do that either.
> 


OK. It could certainly be a performance killer.

> 4. DBI v2 will provide hooks to allow callbacks to be fired
>    on fetching a field and/or row and that could be used by an
>    application for recoding if it wants to 'hide' it under the DBI.
> 
> 5. When selecting data from the database the driver should:
>    - return strings which have a unicode character set as UTF8.
>    - return strings with other character sets as-is (unchanged) on
>      the presumption that the application knows what to do with it.

The problem is that there's no standard metadata currently defined
to provide what the encoding is, and (AFAIK) Perl only has a 
"is_utf8()" method that can be tested on any string independently.

E.g., lets say I'm writing a database IDE with the new Perl/Tk
that handles UNICODE. I want my tool to be dbms/driver independent
(as much as possible). I'm retrieving data to stuff into a fancy
spreadsheet, and I want ptk to use UNICODE so I'm I18N'd. If
the dbms/driver returns UTF8, thats all great. If it doesn't, or it
returns things with multiple encodings (ugh), I don't have
any (standard) metadata to tell me how to force the string into utf8
via Encode::from_to(). Or for that matter, if drivers aren't tagging
the returned strings as UTF8, I don't have any idea what I'm dealing with.

I'm not expecting every driver, or DBI, to normalize everything; rather, just
a piece of info to tell an app what encoding the data is in. Presumably,
just another bit of $sth metadata, e.g., $sth->{CHAR_SET}, to provide
the info. If the driver doesn't know, then it fills in with undef, and
the app is on its own. Otherwise, the app has enough info to make
the necessary conversion:

foreach ([EMAIL PROTECTED]>{TYPES}}) {
    Encode::from_to($row->[$_-1], $sth->{CHAR_SET}->[$_-1], 'utf8')
        if (defined($sth->{CHAR_SET}->[$_-1]) && ($sth->{CHAR_SET}->[$_-1] ne 'utf8'));
}

As for parameter data, there might be an optional CHAR_SET attribute provided,
or perhaps each driver can specify its "preferred" encoding, and the app
can coerce its data into that encoding as needed.

(I think the PerlIO encoding mechanisms provide some direction, or at least
have to deal with similar issues:
http://www.perldoc.com/perl5.8.0/pod/perluniintro.html#Unicode-I-O.
Come to think of it, it might be of interest to those using DBD::CSV
or other file-based DBDs ?)

Actually, this all may be a Perl problem more than a DBI issue. E.g.,
in Java, String objects always have an associated character set,
but all Perl appears to have is "its UTF8" or "I don't know what it is except
a bunch of bytes".

> 
> 6. Drivers that want to can offer a mechanism to recode non-unicode
>    character sets into unicode but I don't see a big need for the
>    DBI to standardize an interface for that at the moment.
> 
> 7. DBI v2 will probably provide a way for applications to force the
>    UTF8 flag on particular columns as a workaround for drivers that
>    don't know the string of bytes they're returing is actually UTF8.
> 
> 8. When passing data to the database (including the SQL statement)
>    the driver should (perhaps) warn if it's presented with UTF8
>    strings but the database or database can't handle unicode.

Qualified warning: attempt to convert to (e.g.) Latin1 first before throwing the
exception:

if (Encode::is_utf8($sql)) {
#
#    my dbms only knows latin1, so check if compatible
#
    $dbh->DBI::set_error(-1, 'Unsupported characters in query.')
        unless  from_to($sql, "utf-8", "iso-8859-1");
}

(the above is likely only reliable on Perl 5.8+)

> 
> Comments welcome, of course, but please stick to practical issues,
> ideally with examples, rather than theoretical ones. Thanks.
> 
> Tim.

And maybe an addition to the driver writer's POD to encourage UTF8
encoding ?

Hopefully this wasn't just a theoritcal discussion...

Dean Arnold
Presicient Corp.
www.presicient.com

Re: DBI and character sets (yet again)

Reply via email to