RE: ora_can_unicode discussion

Andy Hassall Tue, 23 Mar 2004 14:37:57 -0800

Tim Bunce wrote:
> On Sat, Mar 20, 2004 at 03:05:47PM -0500, Lincoln A. Baxter wrote:
> 
>> Of course, there is still the problem of getting dbd-oracle 
>> to actually work with nchar data.  For me its back to that problem...
> 
> I was thinking of ora_can_unicode() primarily in terms of controlling
> the unicode tests. Which ones to run (NCHAR and/or CHAR) and 
> which to skip.
> 
> I see no need to worry about alter session statements. Anyone changing
> the character sets on the fly ought to know what they're doing anyway.


 Fortunately you (currently) can't change client character set in
mid-session. There's an OCI call to do it, but not an 'alter session' as far
as I can tell.

 The DBA can change the database character set with an 'alter system' but
really that isn't ever going to happen.

> I'm also not too worried about the client-side character set. I
> figure we should ask for anything that's unicode on the server-side
> to be given to us as unicode and let perl deal with converting the
> unicode to whatever encoding the application is using.

 Shouldn't this be the other way around, at least for DBD::Oracle - it's
_all_ about the client character set.

 As far as fetching goes, it doesn't matter what the server character set is
so long as your client character set is equal or a superset of it. Unicode's
a superset of pretty much everything, so having your NLS_LANG set to .UTF8
means you won't be losing data.

 For binding, you should make sure you only bind characters in the
intersection of client and database character sets else it'll be a lossy
conversion, but the binding's always done in the client character set (at
least at the moment - see next bit).

 This is what I think of as far as 'UTF-8 support in DBD::Oracle' would be -
am I on the right track here? :

===

1. If your NLS_LANG's character set is anything other than UTF8:
        1a. All data fetched comes is sent to Perl unaltered from that
fetched by OCI; in the client character set (may have been recoded by Oracle
but that's transparent).
        1b. If a Perl string with the utf8 flag is bound to a statement, it
is bound as UTF8 rather than the client character set. Otherwise it is bound
as normal (in the client character set).

2. If your NLS_LANG is set to .UTF8:
        2a. All data fetched comes back with the Perl utf8 flag set, as it
is known to be valid UTF8 since Oracle converts it (if necessary; it may
have originally been Unicode on the server, but that's transparent from the
client side).
        2b. All data bound is bound as UTF8, whether it has the Perl utf8
flag or not.

===

 (National character set only affects the above if you have NLS_NCHAR set,
i.e. you have a different client national character set to your main client
character set).

 Apart from 1b, DBD::Oracle appears to be doing most of this at least in the
last svn revision I tried? "1b" /should/ just be a matter of setting an
OCI_ATTR_CHARSET_ID appropriately on the bind handle if the Perl utf8 flag
is set.

 I think this ties in with Tim's points on the other character set thread:

> 3. I don't really want the DBI to be involved in any recoding
>    of character sets (from client charset to server charset)
>    and I suggest that the drivers don't try to do that either.

> 5. When selecting data from the database the driver should:
>    - return strings which have a unicode character set as UTF8.
>    - return strings with other character sets as-is (unchanged) on
>      the presumption that the application knows what to do with it.

 Sounds right to me. I don't think it should be trying to turn other Unicode
encodings to UTF-8 as I think I read in one of the other mails; if you have
NLS_LANG set to .UTF16 (or whatever the full code is), then you should get
UTF-16 strings without the Perl utf8 flag set. Only NLS_LANG=.UTF8 should
result in utf8-flagged Perl strings being returned, as that's the only
encoding Perl really supports.

> 8. When passing data to the database (including the SQL statement)
>    the driver should (perhaps) warn if it's presented with UTF8
>    strings but the database or database can't handle unicode.

 Whether Oracle can handle being sent Unicode is not an all-or-nothing
thing; this is where it depends on the database character set, and so back
to ora_can_unicode.

 The question that ora_can_unicode answers is "Can I send ANY Unicode
character and be confident it will be stored without corruption?". The
original test failure was because it was trying to store a character that
wasn't representable in the target database.

 A more refined question would be (clearly optional to utf8 support but
seems a useful support function):

 "Given the current combination of client character set, and whether the
utf8 flag is set on the Perl string, can I store this value without data
loss, either in the database charset, or the national charset, or both?".

 $dbh->ora_can_store_string($string) perhaps? Bitmask return value as per
ora_can_unicode?

 There are OCI functions that can answer this, e.g. OCINlsCharSetConvert()
followed by OCICharSetConversionIsReplacementUsed().

 The Euro symbol is a good example for this question, since it's either not
present, or in completely different places in the most popular character
sets.

 e.g. Binding "\x{20ac}" should be fine so long as your database is in UTF8,
one of the other Unicode sets, or WE8ISO8859P15 or WE8MSWIN1252. But not if
it's WE8ISO8859P1 (Latin-1) - it doesn't have a Euro symbol.

 If you try and bind its single-byte equivalent, chr(128) or chr(164), it
depends on your client character set as well as the database character set.

 Hope I'm making some sense :-)

-- 
Andy Hassall <[EMAIL PROTECTED]> / Space: disk usage analysis tool
<http://www.andyh.co.uk> / <http://www.andyhsoftware.co.uk/space>

RE: ora_can_unicode discussion

Reply via email to