Michael B. Allen writes: > The popular Expat XML parser does not have built in support for handling > character sets other than UTF-8, UTF-16, ISO-8859-1, and ASCII.
This is usually sufficient. I've never seen an XML file in anything else than ISO-8859-1 or UTF-8. > For example, for EUC-JP, I think I would have to populate the map with > the ASCII character set, put -2 in the 80 to FF range, This is not correct. EUC-JP also has 3-byte sequences. 0x80..0x8D -> -1 0x8E -> -2 0x8F -> -3 0xA1..0xFE -> -2 0xFF -> -1 > and assuming the platform is __STDC_ISO_10646__ I would use a > wrapper convert function to the euc_jp_mbtowc function from > libiconv. You don't need __STDC_ISO_10646__, because although the function is called euc_jp_mbtowc, it doesn't use the wchar_t type. Also, consider using the iconv() function itself, so that on Linux you can use the one in glibc. > My question is, can I create such a handler that builds against the > libiconv sources that does not require semantic information about each > encoding? Yes, you can mechanically extract the needed information by calling iconv() once for every possible iconv sequence. > Is there a way to determine how many bytes will be needed to > represent each character in a character set? Yes, just take a look at the conversion tables, e.g. in libiconv/tests/*.TXT. > Can I dynamically generate this information with Markus Kuhn's perl > tools or by some other means? If you want it to be slow, you can certainly use perl for that purpose. Bruno -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/