Re: DBI and character sets (yet again)

Honza Pazdziora Mon, 22 Mar 2004 00:50:20 -0800

On Sun, Mar 21, 2004 at 01:10:27PM -0800, Dean Arnold wrote:
> 2. The charset used for client<->server transfer syntax
> may not be the same as the internal storage charset
> of the DBMS (e.g., UTF8 may be used for transfer,
> but the target columns may be Latin1; attemping
> to insert a non-Latin1 compatible UTF8 character
> results in a DBMS error).


... or a silent conversion, either via transliteration or to some
substitute character.

> Assume I'm transfering data from DBMS A in Latin1
> to DBMS B in UTF8. Are there any guarantees that
> DBMS A will output its returned character data in UTF8 ?
> Or, going in the other direction, that DBMS A will
> recognize that the parameter data supplied from DBMS B
> is in UTF8, and will make the necessary conversion to Latin1
> before transfer ?

I believe that you always have to tell the database server in what
encoding your client has the data and ask it to do the conversion for
you. Via environment variable (NLS_LANG), command (SET
CLIENT_ENCODING, SET CHARACTER SET), or other means.

> Since there doesn't appear to be a way to "tag" a given
> Perl string with its charset encoding (other than UTF8),
> it would appear that normalization on UTF8 would be
> required, or, alternately, some add'l column/parameter
> metadata is needed to define the encoding, which (to my
> knowledge) does not yet exist in std. DBI metadata. 

You mean on input or on output. As far as I know, DBD::Oracle and
DBD::Pg (with pg_enable_utf8) already mark the strings they return
as UTF-8 Perl strings (and DBI->trace will show them with double
quotes). So the internal Perl UTF-8 flag is the mechanism you are
looking for.

> The DBI POD makes reference to using the defined locale...does
> that mean drivers should use $ENV{LANGUAGE} or $ENV{LANG}
> or $ENV{LC_ALL} or $ENV{LC_TYPE} to determine the "normalized"
> charset, and fallback to UTF8 if none of the above are defined ?

Since you cannot native string handling in Perl other than in US-ASCII
and UTF-8 (for example, uc $string will not work if the $string hold
binary characters in ISO-8859-3), I believe that using UTF-8 is the
way to go. With all other encodings yielding undocumented results ...

-- 
------------------------------------------------------------------------
 Honza Pazdziora | [EMAIL PROTECTED] | http://www.fi.muni.cz/~adelton/
 .project: Perl, mod_perl, DBI, Oracle, large Web systems, XML/XSL, ...
                Only self-confident people can be simple.

Re: DBI and character sets (yet again)

Reply via email to