Eugene Shkel created XALANJ-2593:
------------------------------------
Summary: Incorrect showing of supplementary characters in
attributes
Key: XALANJ-2593
URL: https://issues.apache.org/jira/browse/XALANJ-2593
Project: XalanJ2
Issue Type: Bug
Security Level: No security risk; visible to anyone (Ordinary problems in
Xalan projects. Anybody can view the issue.)
Components: Serialization
Affects Versions: 2.7.2
Environment: Win 7 x64, Java 1.6
Reporter: Eugene Shkel
Assignee: Steven J. Hathaway
In Xalan 2.7.2 the supplementary characters (see
http://www.oracle.com/technetwork/articles/javase/supplementary-142654.html for
details) shown incorrectly in attributes .
For example, I need to show symbols 𣎴 (& # 144308 ; ) or 𠘨 (& # 132648 ; ) in
attribute "y" of element "x"
Expected result: {code}<?xml version="1.0" encoding="UTF-8"?><x y="𣎴 -
𠘨"/>{code}
Actual result for Xalan 2.7.2 is:{code} <?xml version="1.0"
encoding="UTF-8"?><x y="�� - ��"/>{code}
Code snippet for test:
{code}
public static void main(String[] argv) throws Exception {
TransformerFactory tFactory = TransformerFactory.newInstance();
StreamSource stylesource = new StreamSource(new StringReader("<?xml
version=\"1.0\" encoding=\"UTF-8\"?><xsl:stylesheet
xmlns:xsl=\"http://www.w3.org/1999/XSL/Transform\" version=\"1.0\"
><xsl:template match=\"/\"><x y=\"{xslt/search/value1}\"
/></xsl:template></xsl:stylesheet>"));
Transformer transformer = tFactory.newTransformer(stylesource);
StreamSource source = new StreamSource(new StringReader("<?xml
version=\"1.0\"?><xslt><search><value1>𣎴 - 𠘨</value1></search></xslt>"));
Result result = new StreamResult(System.out);
transformer.transform(source, result);
}
{code}
The problem relates to the method
org.apache.xml.serializer.ToStream.writeAttrString(Writer, String, String).
{code}
if (m_charInfo.shouldMapAttrChar(ch)) {
// The character is supposed to be replaced by a String
// e.g. '&' --> "&"
// e.g. '<' --> "<"
accumDefaultEscape(writer, ch, i, stringChars, len, false,
true);
}
{code}
this part doesn't process multicharacter sequences like supplementary
characters within Java platform and this leads to executing next part within
same method
{code}
else {
// This is a fallback plan, we should never get here
// but if the character wasn't previously handled
// (i.e. isn't in the encoding, etc.) then what
// should we do? We choose to write out a character ref
writer.write("!13&#");
writer.write(Integer.toString(ch));
writer.write(';');
}
{code}
PS: Can't add patch file, so put here.
{code}
--- src\org\apache\xml\serializer\ToStream.java 2014-03-26 17:21:30 +0200
+++ src\org\apache\xml\serializer\ToStream.java 2014-09-09 19:09:30 +0300
@@ -2112,8 +2112,13 @@
// e.g. '&' --> "&"
// e.g. '<' --> "<"
accumDefaultEscape(writer, ch, i, stringChars, len, false,
true);
- }
- else {
+ } else if (Encodings.isHighUTF16Surrogate(ch)) {
+ // more than single input character can be processed
+ // within accumDefaultEscape()
+ // so we set appropriate value for loop for().
+ i = accumDefaultEscape(writer, ch, i, stringChars, len, false,
true);
+
+ } else {
if (0x0 <= ch && ch <= 0x1F) {
// Range 0x00 through 0x1F inclusive
// This covers the non-whitespace control characters
{code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]