[ https://issues.apache.org/jira/browse/XALANJ-2617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16612680#comment-16612680 ]
Peter De Maeyer commented on XALANJ-2617: ----------------------------------------- It can be proven with a unit test that Daniel's fix breaks some scenarios that used to work. As I suspected, the "if" has to be an "else if". I've attached my own new patch + unit tests. Note that the patch spans 2 repositories: the fix is relative to [http://svn.apache.org/repos/asf/xalan/java/trunk,] the unit test is relative to [http://svn.apache.org/repos/asf/xalan/test/trunk|http://svn.apache.org/repos/asf/xalan/test/trunk.]. Just in case the patch isn't readable, this is essence of the test code: {code:java} /** * This test case illustrates the original problem with high-surrogate characters. * This is broken in Xalan 2.7.2, hence the need for a fix. */ public void serializationOfHighSurrogateCharactersInUtf8() throws Throwable { reporter.testCaseInit("serializationOfHighSurrogateCharactersInUtf8"); try { String value = "\uD840\uDC0B"; serializationOf(value, "&#" + toCodePoint(value.charAt(0), value.charAt(1)) + ";", "UTF-8"); } finally { reporter.testCaseClose(); } } /** * This is a sanity test case illustrating some US-ASCII characters and some low-surrogate non-ASCII characters. * It works in Xalan 2.7.2 and with any of the patches, it's just a basic sanity check. */ public void serializationOfLowSurrogateCharactersInUtf8() throws Throwable { reporter.testCaseInit("serializationOfLowSurrogateCharactersInUtf8"); try { serializationOf("This is gonna cost ya some €€€", "This is gonna cost ya some €€€", "UTF-8"); } finally { reporter.testCaseClose(); } } /** * This test case illustrates a use case which works in Xalan 2.7.2 but which got <i>broken</i> by Daniel's patch. */ public void serializationOfLineSeparatorInAscii() throws Throwable { reporter.testCaseInit("serializationOfLineSeparatorInAscii"); try { serializationOf(String.valueOf((char) 0x2028), "
", "US-ASCII"); } finally { reporter.testCaseClose(); } } private void serializationOf(String value, String expectedXmlValue, String encoding) throws ParserConfigurationException, TransformerException, IOException, SAXException { System.out.println("Expected value: " + value); String expected = "<?xml version=\"1.0\" encoding=\"" + encoding + "\"?><a>" + expectedXmlValue + "</a>"; System.out.println(" Expected XML: " + expected); StringWriter writer = new StringWriter(); final DocumentBuilder documentBuilder = DocumentBuilderFactory.newInstance().newDocumentBuilder(); Document dom = documentBuilder.newDocument(); final Element rootEl = dom.createElement("a"); rootEl.setTextContent(value); dom.appendChild(rootEl); Transformer transformer = TransformerFactory.newInstance().newTransformer(); transformer.setOutputProperty(javax.xml.transform.OutputKeys.ENCODING, encoding); transformer.transform(new DOMSource(dom), new javax.xml.transform.stream.StreamResult(writer)); String actual = writer.toString(); System.out.println(" Actual XML: " + actual); InputSource inputSource = new InputSource(); inputSource.setCharacterStream(new StringReader(actual)); System.out.println(" Actual value: " + documentBuilder.parse(inputSource).getDocumentElement().getTextContent()); reporter.check(actual, expected, actual + Character.LINE_SEPARATOR + " must be equal to " + Character.LINE_SEPARATOR + expected); } /** * This is a duplicate of {@link org.apache.xml.serializer.Encodings#toCodePoint(char, char)}. * We can't use that method because it's package-private. * We can't use {@link String#codePointAt(int)} either because it's @Since Java 1.5 and this codebase needs to be Java 1.3 compliant. */ static int toCodePoint(char highSurrogate, char lowSurrogate) { int codePoint = ((highSurrogate - 0xd800) << 10) + (lowSurrogate - 0xdc00) + 0x10000; return codePoint; } {code} > Serializer produces separately escaped surrogate pair instead of codepoint > -------------------------------------------------------------------------- > > Key: XALANJ-2617 > URL: https://issues.apache.org/jira/browse/XALANJ-2617 > Project: XalanJ2 > Issue Type: Bug > Security Level: No security risk; visible to anyone(Ordinary problems in > Xalan projects. Anybody can view the issue.) > Components: Serialization, Xalan > Affects Versions: 2.7.1, 2.7.2 > Reporter: Daniel Kec > Assignee: Steven J. Hathaway > Priority: Major > Attachments: JI9053942.java, > XALANJ-2617_Fix_missing_surrogate_pairs_support.patch, > XALANJ-2617_Fix_missing_surrogate_pairs_support_new.patch > > > When trying to serialize XML with char consisting of unicode surogate char > "\uD840\uDC0B" I have tried several and non worked. XML Transformer creates > XML string with escaped surogate pair separately, which makes XML > unparseable. eg.: SAXParseException; Character reference "�" is an > invalid XML character. It looks like a bug introduced in the XALANJ-2271 fix. > > {code:java|title=Output of Xalan ver. 2.7.2} > kec@phoebe:~/Downloads$ java -version > java version "1.8.0_171" > Java(TM) SE Runtime Environment (build 1.8.0_171-b11) > Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode) > kec@phoebe:~/Downloads$ java -cp > /home/kec/.m2/repository/xml-apis/xml-apis/1.4.01/xml-apis-1.4.01.jar:/home/kec/.m2/repository/xalan/xalan/2.7.2/xalan-2.7.2.jar:/home/kec/.m2/repository/xalan/serializer/2.7.2/serializer-2.7.2.jar:. > JI9053942 > Character: 𠀋 > EXPECTED: <?xml version="1.0" encoding="UTF-8"?><a>𠀋</a> > ACTUAL: <?xml version="1.0" encoding="UTF-8"?><a>��</a> > [Fatal Error] :1:50: Character reference "&# > {code} > {code:java|title=But Xalan ver. 2.7.0 works OK} > kec@phoebe:~/Downloads$ java -cp > /home/kec/.m2/repository/xml-apis/xml-apis/1.4.01/xml-apis-1.4.01.jar:/home/kec/.m2/repository/xalan/xalan/2.7.0/xalan-2.7.0.jar:/home/kec/.m2/repository/xalan/serializer/2.7.0/serializer-2.7.0.jar:. > JI9053942 > Character: 𠀋 > EXPECTED: <?xml version="1.0" encoding="UTF-8"?><a>𠀋</a> > ACTUAL: <?xml version="1.0" encoding="UTF-8"?><a>𠀋</a> > ACTUAL PARSED CHAR 𠀋 > {code} > {code:java|title=Test} > String value = "\uD840\uDC0B"; > System.out.println("Character: " + value); > System.out.println("EXPECTED: <?xml version=\"1.0\" > encoding=\"UTF-8\"?><a>&#" + value.codePointAt(0) + ";</a>"); > StringWriter writer = new StringWriter(); > final DocumentBuilder documentBuilder = > DocumentBuilderFactory.newInstance().newDocumentBuilder(); > Document dom = documentBuilder.newDocument(); > final Element rootEl = dom.createElement("a"); > rootEl.setTextContent(value); > dom.appendChild(rootEl); > Transformer transformer = TransformerFactory.newInstance().newTransformer(); > transformer.transform(new DOMSource(dom), new > javax.xml.transform.stream.StreamResult(writer)); > String xml = writer.toString(); > System.out.println(" ACTUAL: " + xml); > InputSource inputSource = new InputSource(); > inputSource.setCharacterStream(new StringReader(xml)); > System.out.println("ACTUAL PARSED CHAR " + > documentBuilder.parse(inputSource).getDocumentElement().getTextContent()); > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@xalan.apache.org For additional commands, e-mail: dev-h...@xalan.apache.org