>>
>>So the key question is... can we just do SvUTF8_on(sv) on either
>>of these kinds of Oracle UTF8 encodings? Seems like the answer is
>>yes, from what Jarkko says, because they are both valid UTF8.
>>We just need to document the issue.
> 
> 
> No, Oracle's "UTF8" is very much not valid UTF-8. Valid UTF-8 cannot
> contain surrogates. If you mark a string like this as UTF-8 neither
> the Perl core nor other extension modules will be able to interpret
> it correctly.

Well, it depends what you mean by "interpret correctly"... they will
be perfectly fine _separate_ characters.  But yes, they are pretty
useless -- the UTF-8 machinery of Perl 5 gets rather upset of seeing
these surrogate code points.  No croaks, yes, as I said earlier, but
a lot of -w-noise, and also deeper gurglings from e.g. the regex engine.

> (As people have pointed out earlier in the thread,
> if you want a standard name for this weird form of encoding, that's
> "CESU": http://www.unicode.org/reports/tr26/.)
> 
> You'll need to do a conversion pass before you can mark it as UTF-8.

I think an Encode translation table would be the best place to do this
kind of mapping.  Encode::CESU, anyone?

-- 
Jarkko Hietaniemi <[EMAIL PROTECTED]> http://www.iki.fi/jhi/ "There is this special
biologist word we use for 'stable'.  It is 'dead'." -- Jack Cohen

Reply via email to