Re: DBI and character sets (yet again)

Dean Arnold Mon, 22 Mar 2004 15:10:03 -0800

>
> I think it would help if you formulated a set of questions for driver
> authors (or anyone else) to answer. Especially as finding the right
> questions can be harder than finding the answers.
>
> Here are a few to get you started:
>
>  - Does the database:
> - have any concept of national character sets?
> - at what levels: database, table, field?
> - url for list of character set names?
> - does it support unicode?
>  - Does the database client API:
> - provide access to character set information, and how?
> - at what levels: database, table, field?
> - does it have a concept of a client character set?
> - how is the client charset determined (locale, env var etc)
> - does it perform charset recoding?
>  - Does the DBD driver:
> - (repeat last set of questions)


Er, thats a pretty long list of detailed info, and I'm not certain its all entirely
needed.

How about:

1. Does your DBD currently support NLS encodings ?

2. If so
    a. what character sets does it support ?
    b. how does it determine what character set to use for returned data, and for 
parameter data and
SQL
statements ?
    c. does it provide any metadata to indicate the character set of returned columns ?
    d. does it provide any mechanism for apps to indicate the character set of 
parameter data and/or
SQL statements?
    e. does it perform any internal character set conversions in support of any of the 
above ?

3. If not, do your target database and/or any supporting client access libraries
    support NLS capabilities (ie, *could* you support NLS) ?

I'm assuming at least *some* databases (including some of the more popular)
support different charsets for columns in the same table, so whether some others
do not doesn't (IMHO) really matter (if we want to avoid an LCD implementation).
Either a given DBD

    - supports NLS
    - doesn't support NLS, but the database and client libs do
    - doesn't support NLS because the database and/or client libs don't

And of course, ODBC, JDBC, and ADO have only "fuzzy" answers to the above.

>
> > > 5. When selecting data from the database the driver should:
> > >    - return strings which have a unicode character set as UTF8.
> > >    - return strings with other character sets as-is (unchanged) on
> > >      the presumption that the application knows what to do with it.
> >
> > The problem is that there's no standard metadata currently defined
> > to provide what the encoding is, and (AFAIK) Perl only has a
> > "is_utf8()" method that can be tested on any string independently.
>
> It's a driver-private issue. If a column is unicode then (I think)
> the driver should set $sth->{TYPE}->[...] to the appropriate 'wide'
> type (SQL_WCHAR, SQL_WVARCHAR, etc).

Problem is, thats still a pretty high level. SQL_WCHAR can be UTF8, UTF16, UTF32,
UCS2, etc. And we don't have a standard way of indicating which it is at present, and
we've previously agreed that we're not going to force DBD's to normalize on anything.

(I'm probably getting confused at this point, but bear with me...)

Are we relying on locale to determine which UNICODE encoding
the data is in those cases ?

E.g., my locale's charset is UTF8, and I retrieve some UNICODE columns. The DBD returns
UTF16, but doesn't have a std. means of telling the app that. Is the implicit 
assumption that
either the a) DBD, b) client libs, or c) database must figure out how to locale-ize 
the results ?

For that matter, if my locale's charset is UTF8 and the DBD returns some latin3 
columns,
who is responsible for getting them into the locale's charset ? At present, the app
can't (except using driver-specific i/fs, if they exist), since it doesn't know that 
the
columns are latin3, only that they're SQL_CHAR.

>
> > I'm not expecting every driver, or DBI, to normalize everything; rather, just
> > a piece of info to tell an app what encoding the data is in.
>
> It boils down to unicode or not. And if not then you (currently)
> have to assume that it's the same charset as the client because
> 99.8745% of the time it will be.

Which I assume is to be derived from the locale. OK.

>
> > Presumably,
> > just another bit of $sth metadata, e.g., $sth->{CHAR_SET}, to provide
> > the info. If the driver doesn't know, then it fills in with undef, and
> > the app is on its own. Otherwise, the app has enough info to make
> > the necessary conversion:
>
> You're presuming that all database that support charsets will use
> the same set of names as Encode uses. I hope that is the case but
> it might not be. (Add that to your list of things to discover :)

Or that DBD's will make the effort to map their database's
encoding names to its Perl equivalent, much as we assume
they know how to map whatever their client lib/database uses
to indicate "this is an integer" into SQL_INTEGER.

>
> > As for parameter data, there might be an optional CHAR_SET attribute provided,
> > or perhaps each driver can specify its "preferred" encoding, and the app
> > can coerce its data into that encoding as needed.
>
> The "preferred" encoding could (should?) be unicode.

So the app specifies SQL_WCHAR...but should the DBD then assume its
UTF8 encoded ? Couldn't it be UTF16, UCS2, etc. ? But there's no way
to tell the DBD that. And the DBD may need to use some other
encoding than UTF8.

Maybe it boils down to a set of rules:

1. DBD's should communicate the available locale information
to the datasource (which includes everything below the DBD:
client libs and database)

2. The datasource is responsible for returning data in the client's
locale defined charset.

3. The DBD assumes all data provided by an application is in
the locale defined charset.

4. If the datasource cannot return or accept data in the client's locale
defined charset, then ???

If we can nail down (4), even if its just "throw an error on connection",
then I think we've got your 99.8745% covered.

BTW: I did some checking on setting/getting locale info
(via setlocale(LC_CTYPE)), and AS Perl on Win32 doesn't appear
to behave very well. It reports something, but won't permit
a modification. I tried AS 5.6 on WinXP and AS 5.8.3 on Win2K.
Does any one know how to make that work (Google wasn't too helpful)?
Fedora 1 w/ Perl 5.8.3 works fine. I don't know if that impacts our
discussion or not, but I'd hate to settle on a solution that precluded
support for the most common platform.

I've ranted and braindumped enough for one day, so flame away,
Dean Arnold
Presicient Corp.
www.presicient.com

Re: DBI and character sets (yet again)

Reply via email to