Re: AL32UTF8
[The background to this is that Lincoln and I have been working on Unicode support for DBD::Oracle. (Actually Lincoln's done most of the heavy lifting, I've mostly been setting goals and directions at the DBI API level and scratching at edge cases. Like this one.)] On Thu, Apr 29, 2004 at 09:23:45PM +0300, Jarkko Hietaniemi wrote: Tim Bunce wrote: Am I right in thinking that perl's internal utf8 representation represents surrogates as a single (4 byte) code point and not as two separate code points? Mmmh. Right and wrong... as a single code point, yes, since the real UTF-8 doesn't do surrogates which are only a UTF-16 thing. 4 bytes, no, 3 bytes. This is the form that Oracle call AL32UTF8. Does this http://www.unicode.org/reports/tr26/ look like like Oracle's older (?) UTF8? CESU-8 defines an encoding scheme for Unicode identical to UTF-8 except for its representation of supplementary characters. In CESU-8, supplementary characters are represented as six-byte sequences resulting from the transformation of each UTF-16 surrogate code unit into an eight-bit form similar to the UTF-8 transformation, but without first converting the input surrogate pairs to a scalar value. Yes, that sounds like it. But see my quote from Oracle docs in my reply to Lincoln's email to make sure. (I presume it dates from before UTF16 had surrogate pairs. When they were added to UTF16 they gave a name CESU-8 to what old UTF16 to UTF8 conversion code would produce when given surrogate pairs. A classic standards maneuver :) What would be the effect of setting SvUTF8_on(sv) on a valid utf8 byte string that used surrogates? Would there be problems? You would get out the surrogate code points from the sv, not the supplementary plane code point the surrogate pairs are encoding. Depends what you do with the data: this might be okay, might not. Since it's valid UTF-8, nothing should croak perl-side. Okay. Thanks. Basically I need to document that Oracle AL32UTF8 should be used as the client charset in preference to the older UTF8 because UTF8 doesn't do the best? thing with surrogate pairs. Seems like best is the, er, best word to use here as right would be too strong. But then the shortest form requirement is quite strong so perhaps modern standard would be the right words. Tim.
Re: AL32UTF8
Dear Tim, CESU-8 defines an encoding scheme for Unicode identical to UTF-8 except for its representation of supplementary characters. In CESU-8, supplementary characters are represented as six-byte sequences resulting from the transformation of each UTF-16 surrogate code unit into an eight-bit form similar to the UTF-8 transformation, but without first converting the input surrogate pairs to a scalar value. Yes, that sounds like it. But see my quote from Oracle docs in my reply to Lincoln's email to make sure. (I presume it dates from before UTF16 had surrogate pairs. When they were added to UTF16 they gave a name CESU-8 to what old UTF16 to UTF8 conversion code would produce when given surrogate pairs. A classic standards maneuver :) IIRC AL32UTF8 was introduced at the behest of Oracle (a voting member of Unicode) because they were storing higher plane codes using the surrogate pair technique of UTF-16 mapped into UTF-8 (i.e. resulting in 2 UTF-8 chars or 6 bytes) rather than the correct UTF-8 way of a single char of 4+ bytes. There is no real trouble doing it that way since anyone can convert between the 'wrong' UTF-8 and the correct form. But they found that if you do a simple binary based sort of a string in AL32UTF8 and compare it to a sort in true UTF-8 you end up with a subtly different order. On this basis they made request to the UTC to have AL32UTF8 added as a kludge and out of the kindness of their hearts the UTC agreed thus saving Oracle from a whole heap of work. But all are agreed that UTF-8 and not AL32UTF8 is the way forward. Yours, Martin
Re: AL32UTF8
On Fri, Apr 30, 2004 at 03:49:13PM +0300, Jarkko Hietaniemi wrote: Okay. Thanks. Basically I need to document that Oracle AL32UTF8 should be used as the client charset in preference to the older UTF8 because UTF8 doesn't do the best? thing with surrogate pairs. because what Oracle calls UTF8 is not conformant with the modern definition of UTF8 Thanks Jarkko. Tim. Seems like best is the, er, best word to use here as right would be too strong. But then the shortest form requirement is quite strong so perhaps modern standard would be the right words. Tim. -- Jarkko Hietaniemi [EMAIL PROTECTED] http://www.iki.fi/jhi/ There is this special biologist word we use for 'stable'. It is 'dead'. -- Jack Cohen