On Mon, Mar 22, 2004 at 03:08:55PM -0800, Dean Arnold wrote:
> >
> > I think it would help if you formulated a set of questions for driver
> > authors (or anyone else) to answer. Especially as finding the right
> > questions can be harder than finding the answers.
> >
> > Here are a few to get you started:
> >
> > - Does the database:
> > - have any concept of national character sets?
> > - at what levels: database, table, field?
> > - url for list of character set names?
> > - does it support unicode?
> > - Does the database client API:
> > - provide access to character set information, and how?
> > - at what levels: database, table, field?
> > - does it have a concept of a client character set?
> > - how is the client charset determined (locale, env var etc)
> > - does it perform charset recoding?
> > - Does the DBD driver:
> > - (repeat last set of questions)
>
> Er, thats a pretty long list of detailed info, and I'm not certain its all entirely
> needed.
Better to ask for more detail up front or I suspect we'll be left
withouth sufficient information to guess the answers to the questions
we didn't think of :)
And here's one: Give a URL to (the relevant part of) the database API specification.
> > It's a driver-private issue. If a column is unicode then (I think)
> > the driver should set $sth->{TYPE}->[...] to the appropriate 'wide'
> > type (SQL_WCHAR, SQL_WVARCHAR, etc).
>
> Problem is, thats still a pretty high level. SQL_WCHAR can be UTF8, UTF16, UTF32,
> UCS2, etc. And we don't have a standard way of indicating which it is at present, and
> we've previously agreed that we're not going to force DBD's to normalize on anything.
Maybe I haven't said this before, but from the DBI's perspective
all the wide char types imply perl's native unicode, i.e. utf8.
> Are we relying on locale to determine which UNICODE encoding
> the data is in those cases ?
The database API should always make it clear. (I think it's typically UCS2.)
> E.g., my locale's charset is UTF8, and I retrieve some UNICODE columns. The DBD
> returns
> UTF16, but doesn't have a std. means of telling the app that. Is the implicit
> assumption that
> either the a) DBD, b) client libs, or c) database must figure out how to locale-ize
> the results ?
Converting UTF16 or UCS2 to UTF8 doesn't require any figuring out.
The locale isn't involved at all. It's a lossless conversion (so long
as you know what you're converting from, which the database API will specify).
> For that matter, if my locale's charset is UTF8 and the DBD returns some latin3
> columns,
> who is responsible for getting them into the locale's charset ? At present, the app
> can't (except using driver-specific i/fs, if they exist), since it doesn't know that
> the
> columns are latin3, only that they're SQL_CHAR.
Currently it's the app, possibly using driver-specific interfaces.
> > > As for parameter data, there might be an optional CHAR_SET attribute provided,
> > > or perhaps each driver can specify its "preferred" encoding, and the app
> > > can coerce its data into that encoding as needed.
> >
> > The "preferred" encoding could (should?) be unicode.
>
> So the app specifies SQL_WCHAR...but should the DBD then assume its
> UTF8 encoded ?
The data from perl to the driver can be assumed to be utf8 if the
app has bound the column as SQL_WCHAR or SQL_WVARCHAR etc.
Wide data from the database API 'up' into to the driver is whatever
the database API says it will be (typically UCS2).
> mAybe it boils down to a set of rules:
>
> 1. DBD's should communicate the available locale information
> to the datasource (which includes everything below the DBD:
> client libs and database)
>
> 2. The datasource is responsible for returning data in the client's
> locale defined charset.
Those first two normally happen by default in most database APIs.
> 3. The DBD assumes all data provided by an application is in
> the locale defined charset.
>
> 4. If the datasource cannot return or accept data in the client's locale
> defined charset, then ???
>
> If we can nail down (4), even if its just "throw an error on connection",
> then I think we've got your 99.8745% covered.
Sadly not. All the above doesn't address how to get unicode _and_ one
other charset to coexist. That's the main goal I'm after as that's
what's needed to support migration towards unicode.
> I've ranted and braindumped enough for one day, so flame away,
Take a deep breath. Gather the driver information (formulate the
questions, email to driver authors, tabulate the results) and then
we'll take stock of where we're at.
Thanks!
Tim.