>> >>So the key question is... can we just do SvUTF8_on(sv) on either >>of these kinds of Oracle UTF8 encodings? Seems like the answer is >>yes, from what Jarkko says, because they are both valid UTF8. >>We just need to document the issue. > > > No, Oracle's "UTF8" is very much not valid UTF-8. Valid UTF-8 cannot > contain surrogates. If you mark a string like this as UTF-8 neither > the Perl core nor other extension modules will be able to interpret > it correctly.
Well, it depends what you mean by "interpret correctly"... they will be perfectly fine _separate_ characters. But yes, they are pretty useless -- the UTF-8 machinery of Perl 5 gets rather upset of seeing these surrogate code points. No croaks, yes, as I said earlier, but a lot of -w-noise, and also deeper gurglings from e.g. the regex engine. > (As people have pointed out earlier in the thread, > if you want a standard name for this weird form of encoding, that's > "CESU": http://www.unicode.org/reports/tr26/.) > > You'll need to do a conversion pass before you can mark it as UTF-8. I think an Encode translation table would be the best place to do this kind of mapping. Encode::CESU, anyone? -- Jarkko Hietaniemi <[EMAIL PROTECTED]> http://www.iki.fi/jhi/ "There is this special biologist word we use for 'stable'. It is 'dead'." -- Jack Cohen