RE: Encoding conversions

Carl W. Brown Sun, 09 Sep 2001 19:22:33 -0700
Peter,

>
> Bull.  Oracle f*cked up, multiple times (including sorting UTF-16 in
> UTF-16 binary order instead of Unicode order) and are trying to
> justify it.

They did it because clients like PeopleSoft and SAP use UCS-2 and do binary
searches.  Code point order searches on UTF-16 data is simple and does not
add much overhead.

/* UTF-16 Unicode sort order table */
static const UChar utf16Fixup[32]={
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0x2000, 0xf800, 0xf800, 0xf800, 0xf800
};

int32_t xiu2_strncmp(UChar *str1,
        UChar * str2,
      int32_t length)
{
    int32_t c1, c2;
    int32_t diff;

    if(length > 0) {
                /* rotate each code unit's value so that surrogates get the highest 
values
*/
                for(;;) {
                        c1=*str1;
                        c1+=utf16Fixup[c1>>11]; /* additional "fix-up" line */
                        c2=*str2;
                        c2+=utf16Fixup[c2>>11]; /* additional "fix-up" line */

                        /* now c1 and c2 are in UTF-32-compatible order */
                        diff=c1-c2;
                        if(diff!=0 || c1==0 || --length == 0) {
                                return diff;
                        }
                        ++str1;
                        ++str2;
                }
    } else {
        return 0;
    }
}

They listened to people who did not understand the issues and thought that
any program supporting UCS-2 would support UTF-16.  Now they are stuck with
no clean migration.

>
> This is yet another example on why UTF-16 is such an enormous
> screwup...
>

There is history.  I have been using Unicode for over 10 years.  Getting
people to buy off on 16 bit characters with UCS-2 was a major selling
obstacle.  Resources were much more costly then.  Nobody thought that
Unicode would ever succeed.  Today it Unicode were starting out today there
probably would not be a 16 bit encoding.

I really don't mind UTF-16 but what burns me up are the 7 bit fans.  What I
hate are code pages like iso-2022.  I looked into it an there is no
reasonable way to do any kind of string manipulation on code pages like
them.  My code checks for stateful code pages and just retunes an error if
you try any string manipulation function.

Carl

-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/
RE: Encoding conversions

Reply via email to