Expat XML Parser Full Character Encoding Support

Michael B. Allen Mon, 20 Jan 2003 23:39:07 -0800

The popular Expat XML parser does not have built in support for handling
character sets other than UTF-8, UTF-16, ISO-8859-1, and ASCII. However it
does provide a mechanism through which many other character sets may be
supported. A handler can be registered that get's called when an unknown
encoding is encountered which must then populate a structure containing
a map and a converter function. From the Expat documentation:


  Expat  places  restrictions on character encodings that it can support by
  filling in the XML_Encoding structure.

    1.  Every ASCII character that can appear in a well-formed XML document
    must  be represented by a single byte, and that byte must correspond to
    it's ASCII encoding (except for the characters $@\^'{}~)

    2. Characters must be encoded in 4 bytes or less.

    3.  All characters encoded must have Unicode scalar values less than or
    equal  to 65535 (0xFFFF)This does not apply to the built-in support for
    UTF-16 and UTF-8

    4.  No  character  may be encoded by more that one distinct sequence of
    bytes 

  The XML_Encoding structure:
 
     typedef struct {
       int map[256];
       void *data;
       int (*convert)(void *data, const char *s);
       void (*release)(void *data);
     } XML_Encoding;

  contains  an  array  of  integers  that  correspond to the 1st byte of an
  encoding  sequence.  If  the  value  in  the  array for a byte is zero or
  positive,  then  the  byte  is  a  single  byte encoding that encodes the
  Unicode scalar value contained in the array. A -1 in this array indicates
  a  malformed  byte.  If  the value is -2, -3, or -4, then the byte is the
  beginning  of  a  2,  3,  or  4  byte  sequence  respectively. Multi-byte
  sequences are sent to the convert function pointed at in the XML_Encoding
  structure.  This  function should return the Unicode scalar value for the
  sequence or -1 if the sequence is malformed. 

  One  pitfall  that  novice  Expat  users  are likely to fall into is that
  although Expat may accept input in various encodings, the strings that it
  passes  to  the handlers are always encoded in UTF-8 or UTF-16 (depending
  on  how  Expat  was  compiled).  Your  application is responsible for any
  translation of these strings into other encodings. 

For example, for EUC-JP, I think I would have to populate the map with
the ASCII character set, put -2 in the 80 to FF range, and assuming the
platform is __STDC_ISO_10646__ I would use a wrapper convert function
to the euc_jp_mbtowc function from libiconv.

My question is, can I create such a handler that builds against the
libiconv sources that does not require semantic information about each
encoding? Is there a way to determine how many bytes will be needed to
represent each character in a character set? Can I dynamically generate
this information with Markus Kuhn's perl tools or by some other means?

Otherwise I definately do not have the required knowledge to implement
this and be certain it is correct for all encodings.

Any ideas?

Thanks,
Mike


-- 
A  program should be written to model the concepts of the task it
performs rather than the physical world or a process because this
maximizes  the  potential  for it to be applied to tasks that are
conceptually  similar and, more important, to tasks that have not
yet been conceived. 
--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Expat XML Parser Full Character Encoding Support

Reply via email to