On Thu, Apr 29, 2004 at 10:42:18PM -0400, Lincoln A. Baxter wrote: > On Thu, 2004-04-29 at 11:16, Tim Bunce wrote: > > Am I right in thinking that perl's internal utf8 representation > > represents surrogates as a single (4 byte) code point and not as > > two separate code points? > > > > This is the form that Oracle call AL32UTF8. > > > > What would be the effect of setting SvUTF8_on(sv) on a valid utf8 > > byte string that used surrogates? Would there be problems? > > (For example, a string returned from Oracle when using the UTF8 > > character set instead of the newer AL32UTF8 one.) > > I think it makes no difference. (at least I could no find one), except > for the internal storage. Several of the tests I wrote print a sql > DUMP(nch), and you can see the difference in the internal store in those > prints. The strings come back to the client, the way they were put in. > > I have tested this with 4 databases > > dbcharset/ncharset > --------- -------- > us7ascii/utf8 > us7ascii/all6utf16 > utf8 /utf8 > utf8 /al16utf16 > > All tests produce the same results with all databases using both .UTF8 > and .AL32UTF8 in NLS_LANG.
Were you using characters that require surrogates in UTF16? If not then you'd wouldn't see a difference between .UTF8 and .AL32UTF8. Here's a relevant quote from the Oracle 9.2 docs at http://www.dbis.informatik.uni-goettingen.de/Teaching/oracle-doc/server.920/a96529/ch6.htm#1005295 "You can use UTF8 and AL32UTF8 by setting NLS_LANG for OCI client applications. If you do not need supplementary characters, then it does not matter whether you choose UTF8 or AL32UTF8. However, if your OCI applications might handle supplementary characters, then you need to make a decision. Because UTF8 can require up to three bytes for each character, one supplementary character is represented in two code points, totalling six bytes. In AL32UTF8, one supplementary character is represented in one code point, totalling four bytes." So the key question is... can we just do SvUTF8_on(sv) on either of these kinds of Oracle UTF8 encodings? Seems like the answer is yes, from what Jarkko says, because they are both valid UTF8. We just need to document the issue. Tim. p.s. If we do opt for defaulting NLS_NCHAR (effectively) if NLS_LANG and NLS_NCHAR are not defined then we should use AL32UTF8 if possible.