Re: DBI and character sets (yet again)

Tim Bunce Mon, 22 Mar 2004 03:08:17 -0800

On Sun, Mar 21, 2004 at 04:50:34PM -0800, Dean Arnold wrote:
> > 
> > > If a list of charset behaviors for each DBD is needed,
> > > I'd be happy to put one together, assuming the DBD authors
> > > send me the details for each driver.
> > 
> > That would be great.
> 
> OK. Shall we start w/ DBD::Oracle ? ;^)


You could, but that's very much a moving target at the moment.

> And driver authors, feel free to forward to me (and/or thlis
> list). I'll try to put together a little webpage with the info.

I think it would help if you formulated a set of questions for driver
authors (or anyone else) to answer. Especially as finding the right
questions can be harder than finding the answers.

Here are a few to get you started:

 - Does the database:
        - have any concept of national character sets?
        - at what levels: database, table, field?
        - url for list of character set names?
        - does it support unicode?
 - Does the database client API:
        - provide access to character set information, and how?
        - at what levels: database, table, field?
        - does it have a concept of a client character set?
        - how is the client charset determined (locale, env var etc)
        - does it perform charset recoding?
 - Does the DBD driver:
        - (repeat last set of questions)


> > 3. I don't really want the DBI to be involved in any recoding
> >    of character sets (from client charset to server charset)
> >    and I suggest that the drivers don't try to do that either.
> 
> OK. It could certainly be a performance killer.

It's not a performance issue. It would do nothing if client charset
the app wants is the same as the server charset. If it's not then
something has to do the recoding somewhere. At this point I don't see
a need for the DBI to do that itself - though it should provide
hooks that can help.

> > 5. When selecting data from the database the driver should:
> >    - return strings which have a unicode character set as UTF8.
> >    - return strings with other character sets as-is (unchanged) on
> >      the presumption that the application knows what to do with it.
> 
> The problem is that there's no standard metadata currently defined
> to provide what the encoding is, and (AFAIK) Perl only has a 
> "is_utf8()" method that can be tested on any string independently.

It's a driver-private issue. If a column is unicode then (I think)
the driver should set $sth->{TYPE}->[...] to the appropriate 'wide'
type (SQL_WCHAR, SQL_WVARCHAR, etc).

> I'm not expecting every driver, or DBI, to normalize everything; rather, just
> a piece of info to tell an app what encoding the data is in.

It boils down to unicode or not. And if not then you (currently)
have to assume that it's the same charset as the client because
99.8745% of the time it will be.

> Presumably,
> just another bit of $sth metadata, e.g., $sth->{CHAR_SET}, to provide
> the info. If the driver doesn't know, then it fills in with undef, and
> the app is on its own. Otherwise, the app has enough info to make
> the necessary conversion:

You're presuming that all database that support charsets will use
the same set of names as Encode uses. I hope that is the case but
it might not be. (Add that to your list of things to discover :)

> As for parameter data, there might be an optional CHAR_SET attribute provided,
> or perhaps each driver can specify its "preferred" encoding, and the app
> can coerce its data into that encoding as needed.

The "preferred" encoding could (should?) be unicode.

> [...]
> (the above is likely only reliable on Perl 5.8+)

There are still a regular stream of unicode related bugs being found
and fixed in perl.  Anyone doing much work with unicode should be
using 5.8.3.

> > Comments welcome, of course, but please stick to practical issues,
> > ideally with examples, rather than theoretical ones. Thanks.
> > 
> > Tim.
> 
> And maybe an addition to the driver writer's POD to encourage UTF8
> encoding ?
> 
> Hopefully this wasn't just a theoritcal discussion...

Only a little :) My point is that as engineers we want a beautiful
system that'll automatically and transparently recode between
multiple client and server charsets. But in practice very few people
need that (see points 1 and 2).

Some features, like charsets, vary greatly in how they're handled
by database APIs.  For these kind of features the DBI usually lags
the drivers. Once a few drivers have implemented their own driver-specific
interfaces, and had them proven as practical by users, *then* I
can work with driver authors to see how best to extend the DBI API
in a way that'll work well for those drivers and others.

That's what happened for $sth->execute_array, and that's exactly
what's happening with $sth->more_results at the moment.

The driver survey would be a valuable step along this road.

Tim.

Re: DBI and character sets (yet again)

Reply via email to