On Fri, 2004-04-30 at 08:03, Tim Bunce wrote: > On Thu, Apr 29, 2004 at 10:42:18PM -0400, Lincoln A. Baxter wrote: > > On Thu, 2004-04-29 at 11:16, Tim Bunce wrote: > > > Am I right in thinking that perl's internal utf8 representation > > > represents surrogates as a single (4 byte) code point and not as > > > two separate code points? > > > > > > This is the form that Oracle call AL32UTF8. > > > > > > What would be the effect of setting SvUTF8_on(sv) on a valid utf8 > > > byte string that used surrogates? Would there be problems? > > > (For example, a string returned from Oracle when using the UTF8 > > > character set instead of the newer AL32UTF8 one.) > > > > I think it makes no difference. (at least I could no find one), except > > for the internal storage. Several of the tests I wrote print a sql > > DUMP(nch), and you can see the difference in the internal store in those > > prints. The strings come back to the client, the way they were put in. > > > > I have tested this with 4 databases > > > > dbcharset/ncharset > > --------- -------- > > us7ascii/utf8 > > us7ascii/all6utf16 > > utf8 /utf8 > > utf8 /al16utf16 > > > > All tests produce the same results with all databases using both .UTF8 > > and .AL32UTF8 in NLS_LANG. > > Were you using characters that require surrogates in UTF16? > If not then you'd wouldn't see a difference between .UTF8 and .AL32UTF8.
Hmmm...err.. probably not... I guess I need to hunt one up. > Here's a relevant quote from the Oracle 9.2 docs at > http://www.dbis.informatik.uni-goettingen.de/Teaching/oracle-doc/server.920/a96529/ch6.htm#1005295 > > "You can use UTF8 and AL32UTF8 by setting NLS_LANG for OCI client > applications. If you do not need supplementary characters, then it > does not matter whether you choose UTF8 or AL32UTF8. However, if > your OCI applications might handle supplementary characters, then > you need to make a decision. Because UTF8 can require up to three > bytes for each character, one supplementary character is represented > in two code points, totalling six bytes. In AL32UTF8, one supplementary > character is represented in one code point, totalling four bytes." > > So the key question is... can we just do SvUTF8_on(sv) on either > of these kinds of Oracle UTF8 encodings? Seems like the answer is > yes, from what Jarkko says, because they are both valid UTF8. > We just need to document the issue. > Seems reasonable. I think you made a good point about the cost of crawling through the data. I'm convinced. If you have not already changed it, I will. > p.s. If we do opt for defaulting NLS_NCHAR (effectively) if NLS_LANG > and NLS_NCHAR are not defined then we should use AL32UTF8 if possible. I changed that last night (to use AL32UTF8).