/Inma Marín López/:

 I have an xml document which includes special characters, for example,

<Document>
            <one>melón</one>
            <two>1º</two>
</Document>

And I want to get it in canonical form, so I do the following (using Apache XML Security and Xerces 2.7.1):

org.apache.xml.security.c14n.Canonicalizer c14n = org.apache.xml.security.c14n.Canonicalizer.getInstance(
org.apache.xml.security.transforms.Transforms.TRANSFORM_C14N_EXCL_WITH_COMMENTS);
byte [] canonicalized = c14n.canonicalize(xmldocument.getBytes());

However, I obtain the following exception:

org.xml.sax.SAXParseException: Invalid byte 2 of 4-byte UTF-8 sequence.

I guess your document should include an XML Declaration [1]:

<?xml version="1.0" encoding="ISO-8859-1 ?>

Because of the rules [2] to detect the character encoding of a document, missing to include an XML Declaration defaults to using UTF-8.

Alternatively you should supply an UTF-8 sequence to the |Canonicalizer.canonicalize(byte[])| method. If |xmldocument| is a |String|:

Canonicalizer c14n;
...
c14n = c14n.canonicalize(xmldocument.getBytes("UTF-8"));

The |String.getBytes()| (no-args) method returns bytes encoding the text using the platform's default encoding, not necessary "ISO-8859-1".

The xml document is ISO-8859-1 encoded, because I want to keep special characters (if I encode it in UTF-8, the document turns into the following:

<Document>
            <one>mel?n</one>
            <two>1?</two>
</Document>

How do you encode the document in UTF-8? You're obviously doing something wrong as Unicode contains the full ISO-8859-1 repertoire for sure. Are you just decoding the "ISO-8859-1" encoded document using "UTF-8" where invalid UTF-8 byte sequences get substituted with '?' (question mark)?

Could you be so kind as to tell me how to parse an ISO-8859-1 encoded document with xerces, please????

Seems you're trying something but asking a different question. The things I've mentioned above still apply. If you don't want or can't add an XML Declaration to the document you could feed a parser with ready decoded character stream instead of byte stream, like:

InputStream byteStream;
...
Reader charStream = new InputStreamReader(byteStream, "ISO-8859-1");
InputSource source;
DocumentBuilder parser;   // it could be SAXParser as well
...
source.setCharacterStream(charStream);
parser.parse(source);


[1] http://www.w3.org/TR/REC-xml/#NT-XMLDecl
[2] http://www.w3.org/TR/REC-xml/#sec-guessing-no-ext-info

--
Stanimir

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to