On Mon, Mar 22, 2004 at 03:08:55PM -0800, Dean Arnold wrote: > > > > I think it would help if you formulated a set of questions for driver > > authors (or anyone else) to answer. Especially as finding the right > > questions can be harder than finding the answers. > > > > Here are a few to get you started: > > > > - Does the database: > > - have any concept of national character sets? > > - at what levels: database, table, field? > > - url for list of character set names? > > - does it support unicode? > > - Does the database client API: > > - provide access to character set information, and how? > > - at what levels: database, table, field? > > - does it have a concept of a client character set? > > - how is the client charset determined (locale, env var etc) > > - does it perform charset recoding? > > - Does the DBD driver: > > - (repeat last set of questions) > > Er, thats a pretty long list of detailed info, and I'm not certain its all entirely > needed.
Better to ask for more detail up front or I suspect we'll be left withouth sufficient information to guess the answers to the questions we didn't think of :) And here's one: Give a URL to (the relevant part of) the database API specification. > > It's a driver-private issue. If a column is unicode then (I think) > > the driver should set $sth->{TYPE}->[...] to the appropriate 'wide' > > type (SQL_WCHAR, SQL_WVARCHAR, etc). > > Problem is, thats still a pretty high level. SQL_WCHAR can be UTF8, UTF16, UTF32, > UCS2, etc. And we don't have a standard way of indicating which it is at present, and > we've previously agreed that we're not going to force DBD's to normalize on anything. Maybe I haven't said this before, but from the DBI's perspective all the wide char types imply perl's native unicode, i.e. utf8. > Are we relying on locale to determine which UNICODE encoding > the data is in those cases ? The database API should always make it clear. (I think it's typically UCS2.) > E.g., my locale's charset is UTF8, and I retrieve some UNICODE columns. The DBD > returns > UTF16, but doesn't have a std. means of telling the app that. Is the implicit > assumption that > either the a) DBD, b) client libs, or c) database must figure out how to locale-ize > the results ? Converting UTF16 or UCS2 to UTF8 doesn't require any figuring out. The locale isn't involved at all. It's a lossless conversion (so long as you know what you're converting from, which the database API will specify). > For that matter, if my locale's charset is UTF8 and the DBD returns some latin3 > columns, > who is responsible for getting them into the locale's charset ? At present, the app > can't (except using driver-specific i/fs, if they exist), since it doesn't know that > the > columns are latin3, only that they're SQL_CHAR. Currently it's the app, possibly using driver-specific interfaces. > > > As for parameter data, there might be an optional CHAR_SET attribute provided, > > > or perhaps each driver can specify its "preferred" encoding, and the app > > > can coerce its data into that encoding as needed. > > > > The "preferred" encoding could (should?) be unicode. > > So the app specifies SQL_WCHAR...but should the DBD then assume its > UTF8 encoded ? The data from perl to the driver can be assumed to be utf8 if the app has bound the column as SQL_WCHAR or SQL_WVARCHAR etc. Wide data from the database API 'up' into to the driver is whatever the database API says it will be (typically UCS2). > mAybe it boils down to a set of rules: > > 1. DBD's should communicate the available locale information > to the datasource (which includes everything below the DBD: > client libs and database) > > 2. The datasource is responsible for returning data in the client's > locale defined charset. Those first two normally happen by default in most database APIs. > 3. The DBD assumes all data provided by an application is in > the locale defined charset. > > 4. If the datasource cannot return or accept data in the client's locale > defined charset, then ??? > > If we can nail down (4), even if its just "throw an error on connection", > then I think we've got your 99.8745% covered. Sadly not. All the above doesn't address how to get unicode _and_ one other charset to coexist. That's the main goal I'm after as that's what's needed to support migration towards unicode. > I've ranted and braindumped enough for one day, so flame away, Take a deep breath. Gather the driver information (formulate the questions, email to driver authors, tabulate the results) and then we'll take stock of where we're at. Thanks! Tim.