RE: Encoding conversions

Carl W. Brown Sun, 09 Sep 2001 20:38:44 -0700
Jungshik Shin,

Currently xIUA supports the following:

enum XIUA_InvalidChar {
    XICSKIP                             =  0,     /* Skip bad characters */
    XICSUBSTITUTE                       =  1,     /* Use substitute character*/
    XICSTOP                             =  2,     /* Stop the conversion*/
    XICFROM_U_ESCAPE_ICU        =  3,     /* Use ICU escape Unicode to Char*/
    XICTO_U_ESCAPE_ICU          =  4,     /* Use ICU escape Char to Unicode*/
    XICFROM_U_ESCAPE_JAVA       =  5,     /* Use JAVA escape Unicode to Char*/
    XICTO_U_ESCAPE_JAVA         =  6,     /* Use JAVA escape Char to Unicode*/
    XICFROM_U_ESCAPE_C          =  7,     /* Use C escape Unicode to Char*/
    XICTO_U_ESCAPE_C            =  8,     /* Use C escape Char to Unicode*/
    XICFROM_U_ESCAPE_XML_DEC    =  9,     /* Use HTML/XML decimal escape
Unicode to Char*/
    XICTO_U_ESCAPE_XML_DEC      =  10,    /* Use HTML/XML decimal escape Char to
Unicode*/
    XICFROM_U_ESCAPE_XML_HEX    =  11,    /* Use HTML/XML hex escape Unicode to
Char*/
    XICTO_U_ESCAPE_XML_HEX      =  12,    /* Use HTML/XML hex escape Char to
Unicode*/
    XICNO_FALLBACK              =  13,    /* Do not use fallback conversion 
characters*/
    XICUSE_FALLBACK             =  14,    /* Use fallback conversion characters*/
};

The ICU is a special sequence.

Java is still UCS-2 oriented so it uses \uxxxx

C escapes to \Uxxxxxxxx if > U+FFFF and \uxxxx of <= U+FFFF.

The XICNO_FALLBACK insures that the mapping is reversible and will not
convert alternate encodings.  XICUSE_FALLBACK allows similar character to
convert (xIUA default).  For example both U+2212 and U+FF0D both map to the
double wide '-'.  You have to pick one going the other way.

The problems with 'U+hhhh[hh]' is that you can not tell if U+123456 is a
single Unicode code point or U+1234 and "56".

Carl


> -----Original Message-----
> From: [EMAIL PROTECTED]
> [mailto:[EMAIL PROTECTED]]On Behalf Of Jungshik Shin
> Sent: Sunday, September 09, 2001 8:25 PM
> To: [EMAIL PROTECTED]; [EMAIL PROTECTED]
> Subject: RE: Encoding conversions
>
>
> On Sun, 9 Sep 2001, Bruno Haible wrote:
> > Carl W. Brown writes:
> >
> > > ICU has an invalid character callback handler.  I use it for
> example to
> > > convert characters that are not in the code page to HTML/XML escape
> > > sequences.
> >
> > You can do that with iconv() as well. With iconv(), the processing
> > simply stops at an invalid/unconvertible character, and the programmer
> > can do any kind of error handling before restarting the conversion.
>
>   Perhaps it might be nice to extend iconv(1) (not a C lib.
> function iconv(3) but a cmd line tool iconv(1) ) to add a couple of
> options as to how to deal with chars not directly representable in the
> target encoding. Needless to say, the default behavior should be as it
> is now.
>
>   --xml : represent chars not in the target encoding/codeset
>           with XML NCRs
>   --ucv : represent chars not in the target encoding/codeset
>           with Unicode Scalar Value in the format of 'U+hhhh[hh]'
>   --ignore_invalid : just skip over invalid characters instead of stopping
>                      at them
>
>   Jungshik Shin
>
> -
> Linux-UTF8:   i18n of Linux on all levels
> Archive:      http://mail.nl.linux.org/linux-utf8/

-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/
RE: Encoding conversions

Reply via email to