On Sat, May 01, 2004 at 05:35:58PM -0400, Lincoln A. Baxter wrote: > Hello Owen, > > On Sat, 2004-05-01 at 16:46, Owen Taylor wrote: > > On Fri, 2004-04-30 at 08:03, Tim Bunce wrote: > > > > > "You can use UTF8 and AL32UTF8 by setting NLS_LANG for OCI client > > > applications. If you do not need supplementary characters, then it > > > does not matter whether you choose UTF8 or AL32UTF8. However, if > > > your OCI applications might handle supplementary characters, then > > > you need to make a decision. Because UTF8 can require up to three > > > bytes for each character, one supplementary character is represented > > > in two code points, totalling six bytes. In AL32UTF8, one supplementary > > > character is represented in one code point, totalling four bytes." > > > > > > So the key question is... can we just do SvUTF8_on(sv) on either > > > of these kinds of Oracle UTF8 encodings? Seems like the answer is > > > yes, from what Jarkko says, because they are both valid UTF8. > > > We just need to document the issue. > > > > No, Oracle's "UTF8" is very much not valid UTF-8. Valid UTF-8 cannot > > contain surrogates. If you mark a string like this as UTF-8 neither > > the Perl core nor other extension modules will be able to interpret > > it correctly. > > > > (As people have pointed out earlier in the thread, > > if you want a standard name for this weird form of encoding, that's > > "CESU": http://www.unicode.org/reports/tr26/.) > > > > You'll need to do a conversion pass before you can mark it as UTF-8. > > Your message comes at a PERFECT time! > > I just spent about 3 hours coming to that same conclusion empiricly: > > I made the changes to do what tim had asked (just mark the string > as UTF8), and it breaks a bunch of stuff, like the 8bit nchar test, > and the long test when column type is LONG. > > I think I am going to back out (or rather... NOT COMMIT) those changes. > leaving the code that inspects the fetched string to see if it ("looks > like") utf8 before setting the flag.
I think we should always mark "Oracle UTF8" strings as "Perl UTF8". Basically "Oracle UTF8" is broken for non-BMP characters. Period. So no one should be using the "Oracle UTF8" character set for them. It just needs a note in the docs. Tim.