/Inma Marín López/:
I have an xml document which includes special characters, for example,
<Document>
<one>melón</one>
<two>1º</two>
</Document>
And I want to get it in canonical form, so I do the following (using
Apache XML Security and Xerces 2.7.1):
org.apache.xml.security.c14n.Canonicalizer c14n =
org.apache.xml.security.c14n.Canonicalizer.getInstance(
org.apache.xml.security.transforms.Transforms.TRANSFORM_C14N_EXCL_WITH_COMMENTS);
byte [] canonicalized =
c14n.canonicalize(xmldocument.getBytes());
However, I obtain the following exception:
org.xml.sax.SAXParseException: Invalid byte 2 of 4-byte UTF-8 sequence.
I guess your document should include an XML Declaration [1]:
<?xml version="1.0" encoding="ISO-8859-1 ?>
Because of the rules [2] to detect the character encoding of a
document, missing to include an XML Declaration defaults to using UTF-8.
Alternatively you should supply an UTF-8 sequence to the
|Canonicalizer.canonicalize(byte[])| method. If |xmldocument| is a
|String|:
Canonicalizer c14n;
...
c14n = c14n.canonicalize(xmldocument.getBytes("UTF-8"));
The |String.getBytes()| (no-args) method returns bytes encoding the
text using the platform's default encoding, not necessary "ISO-8859-1".
The xml document is ISO-8859-1 encoded, because I want to keep special
characters (if I encode it in UTF-8, the document turns into the following:
<Document>
<one>mel?n</one>
<two>1?</two>
</Document>
How do you encode the document in UTF-8? You're obviously doing
something wrong as Unicode contains the full ISO-8859-1 repertoire
for sure. Are you just decoding the "ISO-8859-1" encoded document
using "UTF-8" where invalid UTF-8 byte sequences get substituted
with '?' (question mark)?
Could you be so kind as to tell me how to parse an ISO-8859-1 encoded
document with xerces, please????
Seems you're trying something but asking a different question. The
things I've mentioned above still apply. If you don't want or can't
add an XML Declaration to the document you could feed a parser with
ready decoded character stream instead of byte stream, like:
InputStream byteStream;
...
Reader charStream = new InputStreamReader(byteStream, "ISO-8859-1");
InputSource source;
DocumentBuilder parser; // it could be SAXParser as well
...
source.setCharacterStream(charStream);
parser.parse(source);
[1] http://www.w3.org/TR/REC-xml/#NT-XMLDecl
[2] http://www.w3.org/TR/REC-xml/#sec-guessing-no-ext-info
--
Stanimir
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]