Re: Expat XML Parser Full Character Encoding Support

Bruno Haible Tue, 21 Jan 2003 04:36:30 -0800

Michael B. Allen writes:
> The popular Expat XML parser does not have built in support for handling
> character sets other than UTF-8, UTF-16, ISO-8859-1, and ASCII.


This is usually sufficient. I've never seen an XML file in anything
else than ISO-8859-1 or UTF-8.

> For example, for EUC-JP, I think I would have to populate the map with
> the ASCII character set, put -2 in the 80 to FF range,

This is not correct. EUC-JP also has 3-byte sequences.

   0x80..0x8D -> -1
   0x8E       -> -2
   0x8F       -> -3
   0xA1..0xFE -> -2
   0xFF       -> -1

> and assuming the platform is __STDC_ISO_10646__ I would use a
> wrapper convert function to the euc_jp_mbtowc function from
> libiconv.

You don't need __STDC_ISO_10646__, because although the function is
called euc_jp_mbtowc, it doesn't use the wchar_t type. Also, consider
using the iconv() function itself, so that on Linux you can use the
one in glibc.

> My question is, can I create such a handler that builds against the
> libiconv sources that does not require semantic information about each
> encoding?

Yes, you can mechanically extract the needed information by calling
iconv() once for every possible iconv sequence.

> Is there a way to determine how many bytes will be needed to
> represent each character in a character set?

Yes, just take a look at the conversion tables, e.g. in
libiconv/tests/*.TXT.

> Can I dynamically generate this information with Markus Kuhn's perl
> tools or by some other means?

If you want it to be slow, you can certainly use perl for that
purpose.

Bruno
--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: Expat XML Parser Full Character Encoding Support

Reply via email to