Re: Error when parsing ISO-8859-1 encoded documents

Stanimir Stamenkov Fri, 28 Jul 2006 07:46:52 -0700

/Inma Marín López/:

 I have an xml document which includes special characters, for example,
<Document>
            <one>melón</one>
            <two>1º</two>
</Document>
And I want to get it in canonical form, so I do the following (usingApache XML Security and Xerces 2.7.1):
org.apache.xml.security.c14n.Canonicalizer c14n =org.apache.xml.security.c14n.Canonicalizer.getInstance(
org.apache.xml.security.transforms.Transforms.TRANSFORM_C14N_EXCL_WITH_COMMENTS);
byte [] canonicalized =c14n.canonicalize(xmldocument.getBytes());
However, I obtain the following exception:

org.xml.sax.SAXParseException: Invalid byte 2 of 4-byte UTF-8 sequence.


I guess your document should include an XML Declaration [1]:

<?xml version="1.0" encoding="ISO-8859-1 ?>

Because of the rules [2] to detect the character encoding of adocument, missing to include an XML Declaration defaults to using UTF-8.


Canonicalizer c14n;
...
c14n = c14n.canonicalize(xmldocument.getBytes("UTF-8"));

The |String.getBytes()| (no-args) method returns bytes encoding thetext using the platform's default encoding, not necessary "ISO-8859-1".

The xml document is ISO-8859-1 encoded, because I want to keep specialcharacters (if I encode it in UTF-8, the document turns into the following:
<Document>
            <one>mel?n</one>
            <two>1?</two>
</Document>

How do you encode the document in UTF-8? You're obviously doingsomething wrong as Unicode contains the full ISO-8859-1 repertoirefor sure. Are you just decoding the "ISO-8859-1" encoded documentusing "UTF-8" where invalid UTF-8 byte sequences get substitutedwith '?' (question mark)?

Could you be so kind as to tell me how to parse an ISO-8859-1 encodeddocument with xerces, please????

Seems you're trying something but asking a different question. Thethings I've mentioned above still apply. If you don't want or can'tadd an XML Declaration to the document you could feed a parser withready decoded character stream instead of byte stream, like:


InputStream byteStream;
...
Reader charStream = new InputStreamReader(byteStream, "ISO-8859-1");
InputSource source;
DocumentBuilder parser;   // it could be SAXParser as well
...
source.setCharacterStream(charStream);
parser.parse(source);


[1] http://www.w3.org/TR/REC-xml/#NT-XMLDecl
[2] http://www.w3.org/TR/REC-xml/#sec-guessing-no-ext-info

--
Stanimir

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Error when parsing ISO-8859-1 encoded documents

Reply via email to