On Mon, Mar 22, 2004 at 03:08:55PM -0800, Dean Arnold wrote:
> >
> > I think it would help if you formulated a set of questions for driver
> > authors (or anyone else) to answer. Especially as finding the right
> > questions can be harder than finding the answers.
> >
> > Here are a few to get you started:
> >
> >  - Does the database:
> > - have any concept of national character sets?
> > - at what levels: database, table, field?
> > - url for list of character set names?
> > - does it support unicode?
> >  - Does the database client API:
> > - provide access to character set information, and how?
> > - at what levels: database, table, field?
> > - does it have a concept of a client character set?
> > - how is the client charset determined (locale, env var etc)
> > - does it perform charset recoding?
> >  - Does the DBD driver:
> > - (repeat last set of questions)
> 
> Er, thats a pretty long list of detailed info, and I'm not certain its all entirely
> needed.

Better to ask for more detail up front or I suspect we'll be left
withouth sufficient information to guess the answers to the questions
we didn't think of :)

And here's one: Give a URL to (the relevant part of) the database API specification.


> > It's a driver-private issue. If a column is unicode then (I think)
> > the driver should set $sth->{TYPE}->[...] to the appropriate 'wide'
> > type (SQL_WCHAR, SQL_WVARCHAR, etc).
> 
> Problem is, thats still a pretty high level. SQL_WCHAR can be UTF8, UTF16, UTF32,
> UCS2, etc. And we don't have a standard way of indicating which it is at present, and
> we've previously agreed that we're not going to force DBD's to normalize on anything.

Maybe I haven't said this before, but from the DBI's perspective
all the wide char types imply perl's native unicode, i.e. utf8.

> Are we relying on locale to determine which UNICODE encoding
> the data is in those cases ?

The database API should always make it clear. (I think it's typically UCS2.)

> E.g., my locale's charset is UTF8, and I retrieve some UNICODE columns. The DBD 
> returns
> UTF16, but doesn't have a std. means of telling the app that. Is the implicit 
> assumption that
> either the a) DBD, b) client libs, or c) database must figure out how to locale-ize 
> the results ?

Converting UTF16 or UCS2 to UTF8 doesn't require any figuring out.
The locale isn't involved at all. It's a lossless conversion (so long
as you know what you're converting from, which the database API will specify).

> For that matter, if my locale's charset is UTF8 and the DBD returns some latin3 
> columns,
> who is responsible for getting them into the locale's charset ? At present, the app
> can't (except using driver-specific i/fs, if they exist), since it doesn't know that 
> the
> columns are latin3, only that they're SQL_CHAR.

Currently it's the app, possibly using driver-specific interfaces.

> > > As for parameter data, there might be an optional CHAR_SET attribute provided,
> > > or perhaps each driver can specify its "preferred" encoding, and the app
> > > can coerce its data into that encoding as needed.
> >
> > The "preferred" encoding could (should?) be unicode.
> 
> So the app specifies SQL_WCHAR...but should the DBD then assume its
> UTF8 encoded ?

The data from perl to the driver can be assumed to be utf8 if the
app has bound the column as SQL_WCHAR or SQL_WVARCHAR etc.
Wide data from the database API 'up' into to the driver is whatever
the database API says it will be (typically UCS2).


> mAybe it boils down to a set of rules:
> 
> 1. DBD's should communicate the available locale information
> to the datasource (which includes everything below the DBD:
> client libs and database)
> 
> 2. The datasource is responsible for returning data in the client's
> locale defined charset.

Those first two normally happen by default in most database APIs.

> 3. The DBD assumes all data provided by an application is in
> the locale defined charset.
> 
> 4. If the datasource cannot return or accept data in the client's locale
> defined charset, then ???
> 
> If we can nail down (4), even if its just "throw an error on connection",
> then I think we've got your 99.8745% covered.

Sadly not. All the above doesn't address how to get unicode _and_ one
other charset to coexist.  That's the main goal I'm after as that's
what's needed to support migration towards unicode.

> I've ranted and braindumped enough for one day, so flame away,

Take a deep breath. Gather the driver information (formulate the
questions, email to driver authors, tabulate the results) and then
we'll take stock of where we're at.

Thanks!

Tim.

Reply via email to