Radu Coravu created XERCESJ-1574:
------------------------------------
Summary: Problem with detected encoding for UTF-16 encoded as
Unicode Little
Key: XERCESJ-1574
URL: https://issues.apache.org/jira/browse/XERCESJ-1574
Project: Xerces2-J
Issue Type: Bug
Components: DOM (Level 3 Core)
Affects Versions: 2.11.0
Reporter: Radu Coravu
I have the following test case:
ByteArrayInputStream bis = new ByteArrayInputStream(
"<?xml version=\"1.0\" encoding=\"UTF-16\"?>
<a/>".getBytes("UnicodeLittle"));
InputSource is = new InputSource(bis);
DOMParser dp = new DOMParser();
dp.parse(is);
assertEquals("UTF-16LE", dp.getDocument().getInputEncoding());
The input stream is encoded as "UnicodeLittle" and "
dp.getDocument().getInputEncoding()" should return "UTF-16LE" (at least it did
so in the previous Xerces version). Right now it returns "UTF-16" regardless of
the byte order mark in the input stream.
So a developer using the information from "dp.getDocument().getInputEncoding()"
information does not know how to save the document in order to preserve the
same BOM.
This problem is related to the modifications which were made in the
XMLEntityManager related to encoding detection.
As a proposed modification, in the method:
org.apache.xerces.impl.XMLEntityManager.setupCurrentEntity(String,
XMLInputSource, boolean, boolean)
before the code:
fCurrentEntity = new ScannedEntity(name,....
we could add the following code:
if("UTF-16".equals(encoding)) {
if(isBigEndian != null) {
if(isBigEndian) {
encoding = "UTF-16BE";
} else {
encoding = "UTF-16LE";
}
}
}
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]