On Fri, 2004-04-30 at 08:03, Tim Bunce wrote:
> On Thu, Apr 29, 2004 at 10:42:18PM -0400, Lincoln A. Baxter wrote:
> > On Thu, 2004-04-29 at 11:16, Tim Bunce wrote:
> > > Am I right in thinking that perl's internal utf8 representation
> > > represents surrogates as a single (4 byte) code point and not as
> > > two separate code points?
> > > 
> > > This is the form that Oracle call AL32UTF8.
> > > 
> > > What would be the effect of setting SvUTF8_on(sv) on a valid utf8
> > > byte string that used surrogates? Would there be problems?
> > > (For example, a string returned from Oracle when using the UTF8
> > > character set instead of the newer AL32UTF8 one.)
> >
> > I think it makes no difference. (at least I could no find one), except
> > for the internal storage.  Several of the tests I wrote print a sql
> > DUMP(nch), and you can see the difference in the internal store in those
> > prints.  The strings come back to the client, the way they were put in.
> > 
> > I have tested this with 4 databases
> > 
> > dbcharset/ncharset
> > --------- --------
> > us7ascii/utf8
> > us7ascii/all6utf16
> > utf8    /utf8
> > utf8    /al16utf16
> > 
> > All tests produce the same results with all databases using both .UTF8
> > and .AL32UTF8 in NLS_LANG.
> 
> Were you using characters that require surrogates in UTF16?
> If not then you'd wouldn't see a difference between .UTF8 and .AL32UTF8.

Hmmm...err.. probably not... I guess I need to hunt one up.

> Here's a relevant quote from the Oracle 9.2 docs at
> http://www.dbis.informatik.uni-goettingen.de/Teaching/oracle-doc/server.920/a96529/ch6.htm#1005295
> 
> "You can use UTF8 and AL32UTF8 by setting NLS_LANG for OCI client
> applications. If you do not need supplementary characters, then it
> does not matter whether you choose UTF8 or AL32UTF8. However, if
> your OCI applications might handle supplementary characters, then
> you need to make a decision. Because UTF8 can require up to three
> bytes for each character, one supplementary character is represented
> in two code points, totalling six bytes. In AL32UTF8, one supplementary
> character is represented in one code point, totalling four bytes."
> 
> So the key question is... can we just do SvUTF8_on(sv) on either
> of these kinds of Oracle UTF8 encodings? Seems like the answer is
> yes, from what Jarkko says, because they are both valid UTF8.
> We just need to document the issue.
> 

Seems reasonable.  I think you made a good point about the cost of
crawling through the data. I'm convinced. If you have not already
changed it, I will. 

> p.s. If we do opt for defaulting NLS_NCHAR (effectively) if NLS_LANG
> and NLS_NCHAR are not defined then we should use AL32UTF8 if possible.

I changed that last night (to use AL32UTF8).

Reply via email to