On Sat, May 01, 2004 at 05:35:58PM -0400, Lincoln A. Baxter wrote:
> Hello Owen, 
> 
> On Sat, 2004-05-01 at 16:46, Owen Taylor wrote:
> > On Fri, 2004-04-30 at 08:03, Tim Bunce wrote:
> > 
> > > "You can use UTF8 and AL32UTF8 by setting NLS_LANG for OCI client
> > > applications. If you do not need supplementary characters, then it
> > > does not matter whether you choose UTF8 or AL32UTF8. However, if
> > > your OCI applications might handle supplementary characters, then
> > > you need to make a decision. Because UTF8 can require up to three
> > > bytes for each character, one supplementary character is represented
> > > in two code points, totalling six bytes. In AL32UTF8, one supplementary
> > > character is represented in one code point, totalling four bytes."
> > > 
> > > So the key question is... can we just do SvUTF8_on(sv) on either
> > > of these kinds of Oracle UTF8 encodings? Seems like the answer is
> > > yes, from what Jarkko says, because they are both valid UTF8.
> > > We just need to document the issue.
> > 
> > No, Oracle's "UTF8" is very much not valid UTF-8. Valid UTF-8 cannot
> > contain surrogates. If you mark a string like this as UTF-8 neither
> > the Perl core nor other extension modules will be able to interpret
> > it correctly.
> > 
> > (As people have pointed out earlier in the thread,
> > if you want a standard name for this weird form of encoding, that's
> > "CESU": http://www.unicode.org/reports/tr26/.)
> > 
> > You'll need to do a conversion pass before you can mark it as UTF-8.
> 
> Your message comes at a PERFECT time!
> 
> I just spent about 3 hours coming to that same conclusion empiricly:
> 
> I made the changes to do what tim had asked (just mark the string
> as UTF8), and it breaks a bunch of stuff, like the 8bit nchar test,
> and the long test when column type is LONG.
> 
> I think I am going to back out (or rather... NOT COMMIT) those changes.
> leaving the code that inspects the fetched string to see if it ("looks
> like") utf8 before setting the flag.

I think we should always mark "Oracle UTF8" strings as "Perl UTF8".

Basically "Oracle UTF8" is broken for non-BMP characters. Period.
So no one should be using the "Oracle UTF8" character set for them.

It just needs a note in the docs.

Tim.

Reply via email to