Re: DBI and character sets (yet again)

Tim Bunce Tue, 23 Mar 2004 01:46:31 -0800

On Mon, Mar 22, 2004 at 03:08:55PM -0800, Dean Arnold wrote:
> >
> > I think it would help if you formulated a set of questions for driver
> > authors (or anyone else) to answer. Especially as finding the right
> > questions can be harder than finding the answers.
> >
> > Here are a few to get you started:
> >
> >  - Does the database:
> > - have any concept of national character sets?
> > - at what levels: database, table, field?
> > - url for list of character set names?
> > - does it support unicode?
> >  - Does the database client API:
> > - provide access to character set information, and how?
> > - at what levels: database, table, field?
> > - does it have a concept of a client character set?
> > - how is the client charset determined (locale, env var etc)
> > - does it perform charset recoding?
> >  - Does the DBD driver:
> > - (repeat last set of questions)
> 
> Er, thats a pretty long list of detailed info, and I'm not certain its all entirely
> needed.


Better to ask for more detail up front or I suspect we'll be left
withouth sufficient information to guess the answers to the questions
we didn't think of :)

And here's one: Give a URL to (the relevant part of) the database API specification.


> > It's a driver-private issue. If a column is unicode then (I think)
> > the driver should set $sth->{TYPE}->[...] to the appropriate 'wide'
> > type (SQL_WCHAR, SQL_WVARCHAR, etc).
> 
> Problem is, thats still a pretty high level. SQL_WCHAR can be UTF8, UTF16, UTF32,
> UCS2, etc. And we don't have a standard way of indicating which it is at present, and
> we've previously agreed that we're not going to force DBD's to normalize on anything.

Maybe I haven't said this before, but from the DBI's perspective
all the wide char types imply perl's native unicode, i.e. utf8.

> Are we relying on locale to determine which UNICODE encoding
> the data is in those cases ?

The database API should always make it clear. (I think it's typically UCS2.)

> E.g., my locale's charset is UTF8, and I retrieve some UNICODE columns. The DBD 
> returns
> UTF16, but doesn't have a std. means of telling the app that. Is the implicit 
> assumption that
> either the a) DBD, b) client libs, or c) database must figure out how to locale-ize 
> the results ?

Converting UTF16 or UCS2 to UTF8 doesn't require any figuring out.
The locale isn't involved at all. It's a lossless conversion (so long
as you know what you're converting from, which the database API will specify).

> For that matter, if my locale's charset is UTF8 and the DBD returns some latin3 
> columns,
> who is responsible for getting them into the locale's charset ? At present, the app
> can't (except using driver-specific i/fs, if they exist), since it doesn't know that 
> the
> columns are latin3, only that they're SQL_CHAR.

Currently it's the app, possibly using driver-specific interfaces.

> > > As for parameter data, there might be an optional CHAR_SET attribute provided,
> > > or perhaps each driver can specify its "preferred" encoding, and the app
> > > can coerce its data into that encoding as needed.
> >
> > The "preferred" encoding could (should?) be unicode.
> 
> So the app specifies SQL_WCHAR...but should the DBD then assume its
> UTF8 encoded ?

The data from perl to the driver can be assumed to be utf8 if the
app has bound the column as SQL_WCHAR or SQL_WVARCHAR etc.
Wide data from the database API 'up' into to the driver is whatever
the database API says it will be (typically UCS2).


> mAybe it boils down to a set of rules:
> 
> 1. DBD's should communicate the available locale information
> to the datasource (which includes everything below the DBD:
> client libs and database)
> 
> 2. The datasource is responsible for returning data in the client's
> locale defined charset.

Those first two normally happen by default in most database APIs.

> 3. The DBD assumes all data provided by an application is in
> the locale defined charset.
> 
> 4. If the datasource cannot return or accept data in the client's locale
> defined charset, then ???
> 
> If we can nail down (4), even if its just "throw an error on connection",
> then I think we've got your 99.8745% covered.

Sadly not. All the above doesn't address how to get unicode _and_ one
other charset to coexist.  That's the main goal I'm after as that's
what's needed to support migration towards unicode.

> I've ranted and braindumped enough for one day, so flame away,

Take a deep breath. Gather the driver information (formulate the
questions, email to driver authors, tabulate the results) and then
we'll take stock of where we're at.

Thanks!

Tim.

Re: DBI and character sets (yet again)

Reply via email to