The popular Expat XML parser does not have built in support for handling character sets other than UTF-8, UTF-16, ISO-8859-1, and ASCII. However it does provide a mechanism through which many other character sets may be supported. A handler can be registered that get's called when an unknown encoding is encountered which must then populate a structure containing a map and a converter function. From the Expat documentation:
Expat places restrictions on character encodings that it can support by filling in the XML_Encoding structure. 1. Every ASCII character that can appear in a well-formed XML document must be represented by a single byte, and that byte must correspond to it's ASCII encoding (except for the characters $@\^'{}~) 2. Characters must be encoded in 4 bytes or less. 3. All characters encoded must have Unicode scalar values less than or equal to 65535 (0xFFFF)This does not apply to the built-in support for UTF-16 and UTF-8 4. No character may be encoded by more that one distinct sequence of bytes The XML_Encoding structure: typedef struct { int map[256]; void *data; int (*convert)(void *data, const char *s); void (*release)(void *data); } XML_Encoding; contains an array of integers that correspond to the 1st byte of an encoding sequence. If the value in the array for a byte is zero or positive, then the byte is a single byte encoding that encodes the Unicode scalar value contained in the array. A -1 in this array indicates a malformed byte. If the value is -2, -3, or -4, then the byte is the beginning of a 2, 3, or 4 byte sequence respectively. Multi-byte sequences are sent to the convert function pointed at in the XML_Encoding structure. This function should return the Unicode scalar value for the sequence or -1 if the sequence is malformed. One pitfall that novice Expat users are likely to fall into is that although Expat may accept input in various encodings, the strings that it passes to the handlers are always encoded in UTF-8 or UTF-16 (depending on how Expat was compiled). Your application is responsible for any translation of these strings into other encodings. For example, for EUC-JP, I think I would have to populate the map with the ASCII character set, put -2 in the 80 to FF range, and assuming the platform is __STDC_ISO_10646__ I would use a wrapper convert function to the euc_jp_mbtowc function from libiconv. My question is, can I create such a handler that builds against the libiconv sources that does not require semantic information about each encoding? Is there a way to determine how many bytes will be needed to represent each character in a character set? Can I dynamically generate this information with Markus Kuhn's perl tools or by some other means? Otherwise I definately do not have the required knowledge to implement this and be certain it is correct for all encodings. Any ideas? Thanks, Mike -- A program should be written to model the concepts of the task it performs rather than the physical world or a process because this maximizes the potential for it to be applied to tasks that are conceptually similar and, more important, to tasks that have not yet been conceived. -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/