The popular Expat XML parser does not have built in support for handling
character sets other than UTF-8, UTF-16, ISO-8859-1, and ASCII. However it
does provide a mechanism through which many other character sets may be
supported. A handler can be registered that get's called when an unknown
encoding is encountered which must then populate a structure containing
a map and a converter function. From the Expat documentation:
Expat places restrictions on character encodings that it can support by
filling in the XML_Encoding structure.
1. Every ASCII character that can appear in a well-formed XML document
must be represented by a single byte, and that byte must correspond to
it's ASCII encoding (except for the characters $@\^'{}~)
2. Characters must be encoded in 4 bytes or less.
3. All characters encoded must have Unicode scalar values less than or
equal to 65535 (0xFFFF)This does not apply to the built-in support for
UTF-16 and UTF-8
4. No character may be encoded by more that one distinct sequence of
bytes
The XML_Encoding structure:
typedef struct {
int map[256];
void *data;
int (*convert)(void *data, const char *s);
void (*release)(void *data);
} XML_Encoding;
contains an array of integers that correspond to the 1st byte of an
encoding sequence. If the value in the array for a byte is zero or
positive, then the byte is a single byte encoding that encodes the
Unicode scalar value contained in the array. A -1 in this array indicates
a malformed byte. If the value is -2, -3, or -4, then the byte is the
beginning of a 2, 3, or 4 byte sequence respectively. Multi-byte
sequences are sent to the convert function pointed at in the XML_Encoding
structure. This function should return the Unicode scalar value for the
sequence or -1 if the sequence is malformed.
One pitfall that novice Expat users are likely to fall into is that
although Expat may accept input in various encodings, the strings that it
passes to the handlers are always encoded in UTF-8 or UTF-16 (depending
on how Expat was compiled). Your application is responsible for any
translation of these strings into other encodings.
For example, for EUC-JP, I think I would have to populate the map with
the ASCII character set, put -2 in the 80 to FF range, and assuming the
platform is __STDC_ISO_10646__ I would use a wrapper convert function
to the euc_jp_mbtowc function from libiconv.
My question is, can I create such a handler that builds against the
libiconv sources that does not require semantic information about each
encoding? Is there a way to determine how many bytes will be needed to
represent each character in a character set? Can I dynamically generate
this information with Markus Kuhn's perl tools or by some other means?
Otherwise I definately do not have the required knowledge to implement
this and be certain it is correct for all encodings.
Any ideas?
Thanks,
Mike
--
A program should be written to model the concepts of the task it
performs rather than the physical world or a process because this
maximizes the potential for it to be applied to tasks that are
conceptually similar and, more important, to tasks that have not
yet been conceived.
--
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/linux-utf8/