Hello, I deal with Japanese text quite a bit and was recently parsing a file
that contained the Unicode character U+2000B
(http://www.fileformat.info/info/unicode/char/2000B/index.htm) in a comment.
This character appears to have caused a SAXParseException to be thrown: [Fatal
Error] :484236:25: An invalid XML character (Unicode: 0xd840) was found in the
comment.org.xml.sax.SAXParseException; lineNumber: 484236; columnNumber: 25; An
invalid XML character (Unicode: 0xd840) was found in the comment.
at
com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:254)
at
com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:300)
In this particular case, I was attempting to parse Jim Breen's publicly
available Kanji dictionary file. This file is used quite extensively in many
Japanese/English open-source dictionaries. I exchanged a few emails with Jim
and he is confident that the XML is valid. I've reviewed the "Characters"
section of the W3C XML 1.0 spec
(http://www.w3.org/TR/2004/REC-xml-20040204/#charsets) and honestly can not
tell for certain if U+2000B is valid in a comment. Basically Jim's file has an
entry for each kanji and a comment prior to each entry that looks like this:
<!-- Entry for Kanji: X --> where X is the actual character. If I remove all
such comments, the file parses fine. If you are interested in checking out the
file, it can be downloaded in GZIP format from Jim Breen's site. Info Page:
http://www.csse.monash.edu.au/~jwb/kanjidic2/XML File:
http://www.csse.monash.edu.au/~jwb/kanjidic2/kanjidic2.xml.gz As a side note, I
was able to succesfully parse this file with Apache Xerces Perl. Thank you for
your time. Best Regards,Rick Noelle