Re: AL32UTF8
So the key question is... can we just do SvUTF8_on(sv) on either of these kinds of Oracle UTF8 encodings? Seems like the answer is yes, from what Jarkko says, because they are both valid UTF8. We just need to document the issue. No, Oracle's UTF8 is very much not valid UTF-8. Valid UTF-8 cannot contain surrogates. If you mark a string like this as UTF-8 neither the Perl core nor other extension modules will be able to interpret it correctly. Well, it depends what you mean by interpret correctly... they will be perfectly fine _separate_ characters. But yes, they are pretty useless -- the UTF-8 machinery of Perl 5 gets rather upset of seeing these surrogate code points. No croaks, yes, as I said earlier, but a lot of -w-noise, and also deeper gurglings from e.g. the regex engine. (As people have pointed out earlier in the thread, if you want a standard name for this weird form of encoding, that's CESU: http://www.unicode.org/reports/tr26/.) You'll need to do a conversion pass before you can mark it as UTF-8. I think an Encode translation table would be the best place to do this kind of mapping. Encode::CESU, anyone? -- Jarkko Hietaniemi [EMAIL PROTECTED] http://www.iki.fi/jhi/ There is this special biologist word we use for 'stable'. It is 'dead'. -- Jack Cohen
Re: AL32UTF8
On Sat, 2004-05-01 at 00:37, Lincoln A. Baxter wrote: On Fri, 2004-04-30 at 08:03, Tim Bunce wrote: On Thu, Apr 29, 2004 at 10:42:18PM -0400, Lincoln A. Baxter wrote: On Thu, 2004-04-29 at 11:16, Tim Bunce wrote: Am I right in thinking that perl's internal utf8 representation represents surrogates as a single (4 byte) code point and not as two separate code points? This is the form that Oracle call AL32UTF8. [snip] Were you using characters that require surrogates in UTF16? If not then you'd wouldn't see a difference between .UTF8 and .AL32UTF8. Hmmm...err.. probably not... I guess I need to hunt one up. There is only one case in which 3 and 4 byte characters can be round tripped. After a bunch of other changes and fixups, I tested with the following two new totally invented (by me) super wide characters: row: 8: nice_string=\x{32263A} byte_string=248|140|162|152|186 (3 byte wide char) row: 9: nice_string=\x{2532263A} byte_string=252|165|140|162|152|186 (4 byte wide char) In a database with ncharset=al16utf16, storage is as follows: (NLS_NCHAR= UTF8 or AL32UTF8) row 8: nch=Typ=1 Len=10: 255,253,255,253,255,253,255,253,255,253 row 9: nch=Typ=1 Len=12: 255,253,255,253,255,253,255,253,255,253,255,253 Values can NOT be round tripped. In a database with Ncharset=utf8 storage is as follows (NLS_NCHAR=AL32UTF8) row 8: nch=Typ=1 Len=15: 239,191,189,239,191,189,239,191,189,239,191,189,239,191,189 row 9: nch=Typ=1 Len=18: 239,191,189,239,191,189,239,191,189,239,191,189,239,191,189,239,191 Values can NOT be round tripped. In a database with Ncharset=utf8 and NLS_NCHAR=AL32UTF8 storage is as follows: row 8: nch=Typ=1 Len=5: 248,140,162,152,186 row 9: nch=Typ=1 Len=6: 252,165,140,162,152,186 Values CAN be round tripped! So, it would appear that UTF8 is the PREFERRED Database NCHARSET, not AL16UTF16 And that NLS_NCHAR=UTF8 is more portable than NLS_NCHAR=AL32UTF8. [snip] Seems reasonable. I think you made a good point about the cost of crawling through the data. I'm convinced. If you have not already changed it, I will. p.s. If we do opt for defaulting NLS_NCHAR (effectively) if NLS_LANG and NLS_NCHAR are not defined then we should use AL32UTF8 if possible. I changed that last night (to use AL32UTF8). But given the above results... perhaps I should change it back. Lincoln
Re: AL32UTF8
On Sat, May 01, 2004 at 05:35:58PM -0400, Lincoln A. Baxter wrote: Hello Owen, On Sat, 2004-05-01 at 16:46, Owen Taylor wrote: On Fri, 2004-04-30 at 08:03, Tim Bunce wrote: You can use UTF8 and AL32UTF8 by setting NLS_LANG for OCI client applications. If you do not need supplementary characters, then it does not matter whether you choose UTF8 or AL32UTF8. However, if your OCI applications might handle supplementary characters, then you need to make a decision. Because UTF8 can require up to three bytes for each character, one supplementary character is represented in two code points, totalling six bytes. In AL32UTF8, one supplementary character is represented in one code point, totalling four bytes. So the key question is... can we just do SvUTF8_on(sv) on either of these kinds of Oracle UTF8 encodings? Seems like the answer is yes, from what Jarkko says, because they are both valid UTF8. We just need to document the issue. No, Oracle's UTF8 is very much not valid UTF-8. Valid UTF-8 cannot contain surrogates. If you mark a string like this as UTF-8 neither the Perl core nor other extension modules will be able to interpret it correctly. (As people have pointed out earlier in the thread, if you want a standard name for this weird form of encoding, that's CESU: http://www.unicode.org/reports/tr26/.) You'll need to do a conversion pass before you can mark it as UTF-8. Your message comes at a PERFECT time! I just spent about 3 hours coming to that same conclusion empiricly: I made the changes to do what tim had asked (just mark the string as UTF8), and it breaks a bunch of stuff, like the 8bit nchar test, and the long test when column type is LONG. I think I am going to back out (or rather... NOT COMMIT) those changes. leaving the code that inspects the fetched string to see if it (looks like) utf8 before setting the flag. I think we should always mark Oracle UTF8 strings as Perl UTF8. Basically Oracle UTF8 is broken for non-BMP characters. Period. So no one should be using the Oracle UTF8 character set for them. It just needs a note in the docs. Tim.