Jungshik Shin,
Currently xIUA supports the following:
enum XIUA_InvalidChar {
XICSKIP = 0, /* Skip bad characters */
XICSUBSTITUTE = 1, /* Use substitute character*/
XICSTOP = 2, /* Stop the conversion*/
XICFROM_U_ESCAPE_ICU = 3, /* Use ICU escape Unicode to Char*/
XICTO_U_ESCAPE_ICU = 4, /* Use ICU escape Char to Unicode*/
XICFROM_U_ESCAPE_JAVA = 5, /* Use JAVA escape Unicode to Char*/
XICTO_U_ESCAPE_JAVA = 6, /* Use JAVA escape Char to Unicode*/
XICFROM_U_ESCAPE_C = 7, /* Use C escape Unicode to Char*/
XICTO_U_ESCAPE_C = 8, /* Use C escape Char to Unicode*/
XICFROM_U_ESCAPE_XML_DEC = 9, /* Use HTML/XML decimal escape
Unicode to Char*/
XICTO_U_ESCAPE_XML_DEC = 10, /* Use HTML/XML decimal escape Char to
Unicode*/
XICFROM_U_ESCAPE_XML_HEX = 11, /* Use HTML/XML hex escape Unicode to
Char*/
XICTO_U_ESCAPE_XML_HEX = 12, /* Use HTML/XML hex escape Char to
Unicode*/
XICNO_FALLBACK = 13, /* Do not use fallback conversion
characters*/
XICUSE_FALLBACK = 14, /* Use fallback conversion characters*/
};
The ICU is a special sequence.
Java is still UCS-2 oriented so it uses \uxxxx
C escapes to \Uxxxxxxxx if > U+FFFF and \uxxxx of <= U+FFFF.
The XICNO_FALLBACK insures that the mapping is reversible and will not
convert alternate encodings. XICUSE_FALLBACK allows similar character to
convert (xIUA default). For example both U+2212 and U+FF0D both map to the
double wide '-'. You have to pick one going the other way.
The problems with 'U+hhhh[hh]' is that you can not tell if U+123456 is a
single Unicode code point or U+1234 and "56".
Carl
> -----Original Message-----
> From: [EMAIL PROTECTED]
> [mailto:[EMAIL PROTECTED]]On Behalf Of Jungshik Shin
> Sent: Sunday, September 09, 2001 8:25 PM
> To: [EMAIL PROTECTED]; [EMAIL PROTECTED]
> Subject: RE: Encoding conversions
>
>
> On Sun, 9 Sep 2001, Bruno Haible wrote:
> > Carl W. Brown writes:
> >
> > > ICU has an invalid character callback handler. I use it for
> example to
> > > convert characters that are not in the code page to HTML/XML escape
> > > sequences.
> >
> > You can do that with iconv() as well. With iconv(), the processing
> > simply stops at an invalid/unconvertible character, and the programmer
> > can do any kind of error handling before restarting the conversion.
>
> Perhaps it might be nice to extend iconv(1) (not a C lib.
> function iconv(3) but a cmd line tool iconv(1) ) to add a couple of
> options as to how to deal with chars not directly representable in the
> target encoding. Needless to say, the default behavior should be as it
> is now.
>
> --xml : represent chars not in the target encoding/codeset
> with XML NCRs
> --ucv : represent chars not in the target encoding/codeset
> with Unicode Scalar Value in the format of 'U+hhhh[hh]'
> --ignore_invalid : just skip over invalid characters instead of stopping
> at them
>
> Jungshik Shin
>
> -
> Linux-UTF8: i18n of Linux on all levels
> Archive: http://mail.nl.linux.org/linux-utf8/
-
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/linux-utf8/