Re: AL32UTF8
So the key question is... can we just do SvUTF8_on(sv) on either of these kinds of Oracle UTF8 encodings? Seems like the answer is yes, from what Jarkko says, because they are both valid UTF8. We just need to document the issue. No, Oracle's UTF8 is very much not valid UTF-8. Valid UTF-8 cannot contain surrogates. If you mark a string like this as UTF-8 neither the Perl core nor other extension modules will be able to interpret it correctly. Well, it depends what you mean by interpret correctly... they will be perfectly fine _separate_ characters. But yes, they are pretty useless -- the UTF-8 machinery of Perl 5 gets rather upset of seeing these surrogate code points. No croaks, yes, as I said earlier, but a lot of -w-noise, and also deeper gurglings from e.g. the regex engine. (As people have pointed out earlier in the thread, if you want a standard name for this weird form of encoding, that's CESU: http://www.unicode.org/reports/tr26/.) You'll need to do a conversion pass before you can mark it as UTF-8. I think an Encode translation table would be the best place to do this kind of mapping. Encode::CESU, anyone? -- Jarkko Hietaniemi [EMAIL PROTECTED] http://www.iki.fi/jhi/ There is this special biologist word we use for 'stable'. It is 'dead'. -- Jack Cohen
Re: AL32UTF8
On Sat, 2004-05-01 at 00:37, Lincoln A. Baxter wrote: On Fri, 2004-04-30 at 08:03, Tim Bunce wrote: On Thu, Apr 29, 2004 at 10:42:18PM -0400, Lincoln A. Baxter wrote: On Thu, 2004-04-29 at 11:16, Tim Bunce wrote: Am I right in thinking that perl's internal utf8 representation represents surrogates as a single (4 byte) code point and not as two separate code points? This is the form that Oracle call AL32UTF8. [snip] Were you using characters that require surrogates in UTF16? If not then you'd wouldn't see a difference between .UTF8 and .AL32UTF8. Hmmm...err.. probably not... I guess I need to hunt one up. There is only one case in which 3 and 4 byte characters can be round tripped. After a bunch of other changes and fixups, I tested with the following two new totally invented (by me) super wide characters: row: 8: nice_string=\x{32263A} byte_string=248|140|162|152|186 (3 byte wide char) row: 9: nice_string=\x{2532263A} byte_string=252|165|140|162|152|186 (4 byte wide char) In a database with ncharset=al16utf16, storage is as follows: (NLS_NCHAR= UTF8 or AL32UTF8) row 8: nch=Typ=1 Len=10: 255,253,255,253,255,253,255,253,255,253 row 9: nch=Typ=1 Len=12: 255,253,255,253,255,253,255,253,255,253,255,253 Values can NOT be round tripped. In a database with Ncharset=utf8 storage is as follows (NLS_NCHAR=AL32UTF8) row 8: nch=Typ=1 Len=15: 239,191,189,239,191,189,239,191,189,239,191,189,239,191,189 row 9: nch=Typ=1 Len=18: 239,191,189,239,191,189,239,191,189,239,191,189,239,191,189,239,191 Values can NOT be round tripped. In a database with Ncharset=utf8 and NLS_NCHAR=AL32UTF8 storage is as follows: row 8: nch=Typ=1 Len=5: 248,140,162,152,186 row 9: nch=Typ=1 Len=6: 252,165,140,162,152,186 Values CAN be round tripped! So, it would appear that UTF8 is the PREFERRED Database NCHARSET, not AL16UTF16 And that NLS_NCHAR=UTF8 is more portable than NLS_NCHAR=AL32UTF8. [snip] Seems reasonable. I think you made a good point about the cost of crawling through the data. I'm convinced. If you have not already changed it, I will. p.s. If we do opt for defaulting NLS_NCHAR (effectively) if NLS_LANG and NLS_NCHAR are not defined then we should use AL32UTF8 if possible. I changed that last night (to use AL32UTF8). But given the above results... perhaps I should change it back. Lincoln
Re: AL32UTF8
On Sat, May 01, 2004 at 05:35:58PM -0400, Lincoln A. Baxter wrote: Hello Owen, On Sat, 2004-05-01 at 16:46, Owen Taylor wrote: On Fri, 2004-04-30 at 08:03, Tim Bunce wrote: You can use UTF8 and AL32UTF8 by setting NLS_LANG for OCI client applications. If you do not need supplementary characters, then it does not matter whether you choose UTF8 or AL32UTF8. However, if your OCI applications might handle supplementary characters, then you need to make a decision. Because UTF8 can require up to three bytes for each character, one supplementary character is represented in two code points, totalling six bytes. In AL32UTF8, one supplementary character is represented in one code point, totalling four bytes. So the key question is... can we just do SvUTF8_on(sv) on either of these kinds of Oracle UTF8 encodings? Seems like the answer is yes, from what Jarkko says, because they are both valid UTF8. We just need to document the issue. No, Oracle's UTF8 is very much not valid UTF-8. Valid UTF-8 cannot contain surrogates. If you mark a string like this as UTF-8 neither the Perl core nor other extension modules will be able to interpret it correctly. (As people have pointed out earlier in the thread, if you want a standard name for this weird form of encoding, that's CESU: http://www.unicode.org/reports/tr26/.) You'll need to do a conversion pass before you can mark it as UTF-8. Your message comes at a PERFECT time! I just spent about 3 hours coming to that same conclusion empiricly: I made the changes to do what tim had asked (just mark the string as UTF8), and it breaks a bunch of stuff, like the 8bit nchar test, and the long test when column type is LONG. I think I am going to back out (or rather... NOT COMMIT) those changes. leaving the code that inspects the fetched string to see if it (looks like) utf8 before setting the flag. I think we should always mark Oracle UTF8 strings as Perl UTF8. Basically Oracle UTF8 is broken for non-BMP characters. Period. So no one should be using the Oracle UTF8 character set for them. It just needs a note in the docs. Tim.
Re: AL32UTF8
Tim Bunce wrote: On Fri, Apr 30, 2004 at 10:58:19PM +0700, Martin Hosken wrote: IIRC AL32UTF8 was introduced at the behest of Oracle (a voting member of Unicode) because they were storing higher plane codes using the surrogate pair technique of UTF-16 mapped into UTF-8 (i.e. resulting in 2 UTF-8 chars or 6 bytes) rather than the correct UTF-8 way of a single char of 4+ bytes. There is no real trouble doing it that way since anyone can convert between the 'wrong' UTF-8 and the correct form. But they found that if you do a simple binary based sort of a string in AL32UTF8 and compare it to a sort in true UTF-8 you end up with a subtly different order. On this basis they made request to the UTC to have AL32UTF8 added as a kludge and out of the kindness of their hearts the UTC agreed thus saving Oracle from a whole heap of work. But all are agreed that UTF-8 and not AL32UTF8 is the way forward. Um, now you've confused me. The Oracle docs say In AL32UTF8, one supplementary character is represented in one code point, totalling four bytes. which you say is correct UTF-8 way. So the old Oracle ``UTF8'' charset is what's now called CESU-8 and what Oracle call ``AL32UTF8'' is the correct UTF-8 way. So did you mean CESU-8 when you said AL32UTF8? I guess so. Thank you for reminding me of this. I used to know that, but forgot it and was about to write my colleague to use 'UTF8' (instead of 'AL32UTF8') when she creates a database with Oracle for our project. Oracle is notorious for using 'incorrect' and confusing character encoding names. Their 'AL32UTF8' is the true and only UTF-8 while __their__ 'UTF8' is CESU-8 (a beast that MUST be confined within Oracle and MUST NOT be leaked out to the world at large. Needless to say, it'd be even better had it not been born.) Oracle has no execuse whatsoever for failing to get their 'UTF8' right in the first place because Unicode had been extended beyond BMP a long time before they introduced UTF8 into their product(s) (let alone the fact that ISO 10646 had non-BMP planes from the very beginning in 1980's and that UTF-8 was devised to cover the full set of ISO 10646) However, they failed and in their 'UTF8', a single character beyond BMP was (and still is) encoded as a pair of 3-byte representations of surrogate code points. Apparently for the sake of backward compatibility (I wonder how many instances of Oracle databases existed with non-BMP characters stored in their 'UTF8' when they decided to follow this route), they decided to keep the designation 'UTF8' for CESU-8 and came up with a new designation 'AL32UTF8' for the true and only UTF-8. Jungshik
Re: AL32UTF8
[The background to this is that Lincoln and I have been working on Unicode support for DBD::Oracle. (Actually Lincoln's done most of the heavy lifting, I've mostly been setting goals and directions at the DBI API level and scratching at edge cases. Like this one.)] On Thu, Apr 29, 2004 at 09:23:45PM +0300, Jarkko Hietaniemi wrote: Tim Bunce wrote: Am I right in thinking that perl's internal utf8 representation represents surrogates as a single (4 byte) code point and not as two separate code points? Mmmh. Right and wrong... as a single code point, yes, since the real UTF-8 doesn't do surrogates which are only a UTF-16 thing. 4 bytes, no, 3 bytes. This is the form that Oracle call AL32UTF8. Does this http://www.unicode.org/reports/tr26/ look like like Oracle's older (?) UTF8? CESU-8 defines an encoding scheme for Unicode identical to UTF-8 except for its representation of supplementary characters. In CESU-8, supplementary characters are represented as six-byte sequences resulting from the transformation of each UTF-16 surrogate code unit into an eight-bit form similar to the UTF-8 transformation, but without first converting the input surrogate pairs to a scalar value. Yes, that sounds like it. But see my quote from Oracle docs in my reply to Lincoln's email to make sure. (I presume it dates from before UTF16 had surrogate pairs. When they were added to UTF16 they gave a name CESU-8 to what old UTF16 to UTF8 conversion code would produce when given surrogate pairs. A classic standards maneuver :) What would be the effect of setting SvUTF8_on(sv) on a valid utf8 byte string that used surrogates? Would there be problems? You would get out the surrogate code points from the sv, not the supplementary plane code point the surrogate pairs are encoding. Depends what you do with the data: this might be okay, might not. Since it's valid UTF-8, nothing should croak perl-side. Okay. Thanks. Basically I need to document that Oracle AL32UTF8 should be used as the client charset in preference to the older UTF8 because UTF8 doesn't do the best? thing with surrogate pairs. Seems like best is the, er, best word to use here as right would be too strong. But then the shortest form requirement is quite strong so perhaps modern standard would be the right words. Tim.
Re: AL32UTF8
Dear Tim, CESU-8 defines an encoding scheme for Unicode identical to UTF-8 except for its representation of supplementary characters. In CESU-8, supplementary characters are represented as six-byte sequences resulting from the transformation of each UTF-16 surrogate code unit into an eight-bit form similar to the UTF-8 transformation, but without first converting the input surrogate pairs to a scalar value. Yes, that sounds like it. But see my quote from Oracle docs in my reply to Lincoln's email to make sure. (I presume it dates from before UTF16 had surrogate pairs. When they were added to UTF16 they gave a name CESU-8 to what old UTF16 to UTF8 conversion code would produce when given surrogate pairs. A classic standards maneuver :) IIRC AL32UTF8 was introduced at the behest of Oracle (a voting member of Unicode) because they were storing higher plane codes using the surrogate pair technique of UTF-16 mapped into UTF-8 (i.e. resulting in 2 UTF-8 chars or 6 bytes) rather than the correct UTF-8 way of a single char of 4+ bytes. There is no real trouble doing it that way since anyone can convert between the 'wrong' UTF-8 and the correct form. But they found that if you do a simple binary based sort of a string in AL32UTF8 and compare it to a sort in true UTF-8 you end up with a subtly different order. On this basis they made request to the UTC to have AL32UTF8 added as a kludge and out of the kindness of their hearts the UTC agreed thus saving Oracle from a whole heap of work. But all are agreed that UTF-8 and not AL32UTF8 is the way forward. Yours, Martin
Re: AL32UTF8
On Fri, Apr 30, 2004 at 03:49:13PM +0300, Jarkko Hietaniemi wrote: Okay. Thanks. Basically I need to document that Oracle AL32UTF8 should be used as the client charset in preference to the older UTF8 because UTF8 doesn't do the best? thing with surrogate pairs. because what Oracle calls UTF8 is not conformant with the modern definition of UTF8 Thanks Jarkko. Tim. Seems like best is the, er, best word to use here as right would be too strong. But then the shortest form requirement is quite strong so perhaps modern standard would be the right words. Tim. -- Jarkko Hietaniemi [EMAIL PROTECTED] http://www.iki.fi/jhi/ There is this special biologist word we use for 'stable'. It is 'dead'. -- Jack Cohen
Re: AL32UTF8
Tim Bunce wrote: Am I right in thinking that perl's internal utf8 representation represents surrogates as a single (4 byte) code point and not as two separate code points? Mmmh. Right and wrong... as a single code point, yes, since the real UTF-8 doesn't do surrogates which are only a UTF-16 thing. 4 bytes, no, 3 bytes. This is the form that Oracle call AL32UTF8. Does this http://www.unicode.org/reports/tr26/ look like like Oracle's older (?) UTF8? What would be the effect of setting SvUTF8_on(sv) on a valid utf8 byte string that used surrogates? Would there be problems? You would get out the surrogate code points from the sv, not the supplementary plane code point the surrogate pairs are encoding. Depends what you do with the data: this might be okay, might not. Since it's valid UTF-8, nothing should croak perl-side. (For example, a string returned from Oracle when using the UTF8 character set instead of the newer AL32UTF8 one.) Tim.