On Fri, Apr 30, 2004 at 10:58:19PM +0700, Martin Hosken wrote:
IIRC AL32UTF8 was introduced at the behest of Oracle (a voting member of Unicode) because they were storing higher plane codes using the surrogate pair technique of UTF-16 mapped into UTF-8 (i.e. resulting in 2 UTF-8 chars or 6 bytes) rather than the correct UTF-8 way of a single char of 4+ bytes. There is no real trouble doing it that way since anyone can convert between the 'wrong' UTF-8 and the correct form. But they found that if you do a simple binary based sort of a string in AL32UTF8 and compare it to a sort in true UTF-8 you end up with a subtly different order. On this basis they made request to the UTC to have AL32UTF8 added as a kludge and out of the kindness of their hearts the UTC agreed thus saving Oracle from a whole heap of work. But all are agreed that UTF-8 and not AL32UTF8 is the way forward.
Um, now you've confused me.
The Oracle docs say "In AL32UTF8, one supplementary character is represented in one code point, totalling four bytes." which you say is "correct UTF-8 way". So the old Oracle ``UTF8'' charset is what's now called "CESU-8" and what Oracle call ``AL32UTF8'' is the "correct UTF-8 way".
> So did you mean CESU-8 when you said AL32UTF8?
I guess so.
Thank you for reminding me of this. I used to know that, but forgot it and was about to write my colleague to use 'UTF8' (instead of 'AL32UTF8') when she creates a database with Oracle for our project.
Oracle is notorious for using 'incorrect' and confusing character encoding names. Their 'AL32UTF8' is the true and only UTF-8 while __their__ 'UTF8' is CESU-8 (a beast that MUST be confined within Oracle and MUST NOT be leaked out to the world at large. Needless to say, it'd be even better had it not been born.)
Oracle has no execuse whatsoever for failing to get their 'UTF8' right in the first place because Unicode had been extended beyond BMP a long time before they introduced UTF8 into their product(s) (let alone the fact that ISO 10646 had non-BMP planes from the very beginning in 1980's and that UTF-8 was devised to cover the full set of ISO 10646) However, they failed and in their 'UTF8', a single character beyond BMP was (and still is) encoded as a pair of 3-byte representations of surrogate code points. Apparently for the sake of backward compatibility (I wonder how many instances of Oracle databases existed with non-BMP characters stored in their 'UTF8' when they decided to follow this route), they decided to keep the designation 'UTF8' for CESU-8 and came up with a new designation 'AL32UTF8' for the true and only UTF-8.