> > I think it would help if you formulated a set of questions for driver > authors (or anyone else) to answer. Especially as finding the right > questions can be harder than finding the answers. > > Here are a few to get you started: > > - Does the database: > - have any concept of national character sets? > - at what levels: database, table, field? > - url for list of character set names? > - does it support unicode? > - Does the database client API: > - provide access to character set information, and how? > - at what levels: database, table, field? > - does it have a concept of a client character set? > - how is the client charset determined (locale, env var etc) > - does it perform charset recoding? > - Does the DBD driver: > - (repeat last set of questions)
Er, thats a pretty long list of detailed info, and I'm not certain its all entirely needed. How about: 1. Does your DBD currently support NLS encodings ? 2. If so a. what character sets does it support ? b. how does it determine what character set to use for returned data, and for parameter data and SQL statements ? c. does it provide any metadata to indicate the character set of returned columns ? d. does it provide any mechanism for apps to indicate the character set of parameter data and/or SQL statements? e. does it perform any internal character set conversions in support of any of the above ? 3. If not, do your target database and/or any supporting client access libraries support NLS capabilities (ie, *could* you support NLS) ? I'm assuming at least *some* databases (including some of the more popular) support different charsets for columns in the same table, so whether some others do not doesn't (IMHO) really matter (if we want to avoid an LCD implementation). Either a given DBD - supports NLS - doesn't support NLS, but the database and client libs do - doesn't support NLS because the database and/or client libs don't And of course, ODBC, JDBC, and ADO have only "fuzzy" answers to the above. > > > > 5. When selecting data from the database the driver should: > > > - return strings which have a unicode character set as UTF8. > > > - return strings with other character sets as-is (unchanged) on > > > the presumption that the application knows what to do with it. > > > > The problem is that there's no standard metadata currently defined > > to provide what the encoding is, and (AFAIK) Perl only has a > > "is_utf8()" method that can be tested on any string independently. > > It's a driver-private issue. If a column is unicode then (I think) > the driver should set $sth->{TYPE}->[...] to the appropriate 'wide' > type (SQL_WCHAR, SQL_WVARCHAR, etc). Problem is, thats still a pretty high level. SQL_WCHAR can be UTF8, UTF16, UTF32, UCS2, etc. And we don't have a standard way of indicating which it is at present, and we've previously agreed that we're not going to force DBD's to normalize on anything. (I'm probably getting confused at this point, but bear with me...) Are we relying on locale to determine which UNICODE encoding the data is in those cases ? E.g., my locale's charset is UTF8, and I retrieve some UNICODE columns. The DBD returns UTF16, but doesn't have a std. means of telling the app that. Is the implicit assumption that either the a) DBD, b) client libs, or c) database must figure out how to locale-ize the results ? For that matter, if my locale's charset is UTF8 and the DBD returns some latin3 columns, who is responsible for getting them into the locale's charset ? At present, the app can't (except using driver-specific i/fs, if they exist), since it doesn't know that the columns are latin3, only that they're SQL_CHAR. > > > I'm not expecting every driver, or DBI, to normalize everything; rather, just > > a piece of info to tell an app what encoding the data is in. > > It boils down to unicode or not. And if not then you (currently) > have to assume that it's the same charset as the client because > 99.8745% of the time it will be. Which I assume is to be derived from the locale. OK. > > > Presumably, > > just another bit of $sth metadata, e.g., $sth->{CHAR_SET}, to provide > > the info. If the driver doesn't know, then it fills in with undef, and > > the app is on its own. Otherwise, the app has enough info to make > > the necessary conversion: > > You're presuming that all database that support charsets will use > the same set of names as Encode uses. I hope that is the case but > it might not be. (Add that to your list of things to discover :) Or that DBD's will make the effort to map their database's encoding names to its Perl equivalent, much as we assume they know how to map whatever their client lib/database uses to indicate "this is an integer" into SQL_INTEGER. > > > As for parameter data, there might be an optional CHAR_SET attribute provided, > > or perhaps each driver can specify its "preferred" encoding, and the app > > can coerce its data into that encoding as needed. > > The "preferred" encoding could (should?) be unicode. So the app specifies SQL_WCHAR...but should the DBD then assume its UTF8 encoded ? Couldn't it be UTF16, UCS2, etc. ? But there's no way to tell the DBD that. And the DBD may need to use some other encoding than UTF8. Maybe it boils down to a set of rules: 1. DBD's should communicate the available locale information to the datasource (which includes everything below the DBD: client libs and database) 2. The datasource is responsible for returning data in the client's locale defined charset. 3. The DBD assumes all data provided by an application is in the locale defined charset. 4. If the datasource cannot return or accept data in the client's locale defined charset, then ??? If we can nail down (4), even if its just "throw an error on connection", then I think we've got your 99.8745% covered. BTW: I did some checking on setting/getting locale info (via setlocale(LC_CTYPE)), and AS Perl on Win32 doesn't appear to behave very well. It reports something, but won't permit a modification. I tried AS 5.6 on WinXP and AS 5.8.3 on Win2K. Does any one know how to make that work (Google wasn't too helpful)? Fedora 1 w/ Perl 5.8.3 works fine. I don't know if that impacts our discussion or not, but I'd hate to settle on a solution that precluded support for the most common platform. I've ranted and braindumped enough for one day, so flame away, Dean Arnold Presicient Corp. www.presicient.com