On Fri, 2004-04-30 at 08:03, Tim Bunce wrote: > "You can use UTF8 and AL32UTF8 by setting NLS_LANG for OCI client > applications. If you do not need supplementary characters, then it > does not matter whether you choose UTF8 or AL32UTF8. However, if > your OCI applications might handle supplementary characters, then > you need to make a decision. Because UTF8 can require up to three > bytes for each character, one supplementary character is represented > in two code points, totalling six bytes. In AL32UTF8, one supplementary > character is represented in one code point, totalling four bytes." > > So the key question is... can we just do SvUTF8_on(sv) on either > of these kinds of Oracle UTF8 encodings? Seems like the answer is > yes, from what Jarkko says, because they are both valid UTF8. > We just need to document the issue.
No, Oracle's "UTF8" is very much not valid UTF-8. Valid UTF-8 cannot contain surrogates. If you mark a string like this as UTF-8 neither the Perl core nor other extension modules will be able to interpret it correctly. (As people have pointed out earlier in the thread, if you want a standard name for this weird form of encoding, that's "CESU": http://www.unicode.org/reports/tr26/.) You'll need to do a conversion pass before you can mark it as UTF-8. Regards, Owen
signature.asc
Description: This is a digitally signed message part