> > > Is there a consistent charset encoding behavior defined for > > DBI at this time ? > > No. > > > If not, is a rule wrt charset encoding behavior needed ? > > Yes. > > > If a list of charset behaviors for each DBD is needed, > > I'd be happy to put one together, assuming the DBD authors > > send me the details for each driver. > > That would be great.
OK. Shall we start w/ DBD::Oracle ? ;^) And driver authors, feel free to forward to me (and/or thlis list). I'll try to put together a little webpage with the info. > > 1. Most applications only work with one character set encoding > (not counting UTF8). Obvious example: Latin-1. Agreed. > > 2. Unicode is where we're going. Get used to it. > Agreed. > 3. I don't really want the DBI to be involved in any recoding > of character sets (from client charset to server charset) > and I suggest that the drivers don't try to do that either. > OK. It could certainly be a performance killer. > 4. DBI v2 will provide hooks to allow callbacks to be fired > on fetching a field and/or row and that could be used by an > application for recoding if it wants to 'hide' it under the DBI. > > 5. When selecting data from the database the driver should: > - return strings which have a unicode character set as UTF8. > - return strings with other character sets as-is (unchanged) on > the presumption that the application knows what to do with it. The problem is that there's no standard metadata currently defined to provide what the encoding is, and (AFAIK) Perl only has a "is_utf8()" method that can be tested on any string independently. E.g., lets say I'm writing a database IDE with the new Perl/Tk that handles UNICODE. I want my tool to be dbms/driver independent (as much as possible). I'm retrieving data to stuff into a fancy spreadsheet, and I want ptk to use UNICODE so I'm I18N'd. If the dbms/driver returns UTF8, thats all great. If it doesn't, or it returns things with multiple encodings (ugh), I don't have any (standard) metadata to tell me how to force the string into utf8 via Encode::from_to(). Or for that matter, if drivers aren't tagging the returned strings as UTF8, I don't have any idea what I'm dealing with. I'm not expecting every driver, or DBI, to normalize everything; rather, just a piece of info to tell an app what encoding the data is in. Presumably, just another bit of $sth metadata, e.g., $sth->{CHAR_SET}, to provide the info. If the driver doesn't know, then it fills in with undef, and the app is on its own. Otherwise, the app has enough info to make the necessary conversion: foreach ([EMAIL PROTECTED]>{TYPES}}) { Encode::from_to($row->[$_-1], $sth->{CHAR_SET}->[$_-1], 'utf8') if (defined($sth->{CHAR_SET}->[$_-1]) && ($sth->{CHAR_SET}->[$_-1] ne 'utf8')); } As for parameter data, there might be an optional CHAR_SET attribute provided, or perhaps each driver can specify its "preferred" encoding, and the app can coerce its data into that encoding as needed. (I think the PerlIO encoding mechanisms provide some direction, or at least have to deal with similar issues: http://www.perldoc.com/perl5.8.0/pod/perluniintro.html#Unicode-I-O. Come to think of it, it might be of interest to those using DBD::CSV or other file-based DBDs ?) Actually, this all may be a Perl problem more than a DBI issue. E.g., in Java, String objects always have an associated character set, but all Perl appears to have is "its UTF8" or "I don't know what it is except a bunch of bytes". > > 6. Drivers that want to can offer a mechanism to recode non-unicode > character sets into unicode but I don't see a big need for the > DBI to standardize an interface for that at the moment. > > 7. DBI v2 will probably provide a way for applications to force the > UTF8 flag on particular columns as a workaround for drivers that > don't know the string of bytes they're returing is actually UTF8. > > 8. When passing data to the database (including the SQL statement) > the driver should (perhaps) warn if it's presented with UTF8 > strings but the database or database can't handle unicode. Qualified warning: attempt to convert to (e.g.) Latin1 first before throwing the exception: if (Encode::is_utf8($sql)) { # # my dbms only knows latin1, so check if compatible # $dbh->DBI::set_error(-1, 'Unsupported characters in query.') unless from_to($sql, "utf-8", "iso-8859-1"); } (the above is likely only reliable on Perl 5.8+) > > Comments welcome, of course, but please stick to practical issues, > ideally with examples, rather than theoretical ones. Thanks. > > Tim. And maybe an addition to the driver writer's POD to encourage UTF8 encoding ? Hopefully this wasn't just a theoritcal discussion... Dean Arnold Presicient Corp. www.presicient.com