[ https://issues.apache.org/jira/browse/XALANJ-2560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17805831#comment-17805831 ]
Joseph Kessselman commented on XALANJ-2560: ------------------------------------------- Took a quick look at the openjdk implementation. More changes there than just fixing this – some code refactoring into a utilties class, some code reformatting, the decision to use .endswith("yes") rather than .equals("yes") in testing some parameters. I don't see anything inherently unreasonable so far, but it discourages just picking up their forked versions of To*Stream. Their logic for handling high characters seems to make sense at first glance, though. The comments point out that when a character is not in the target encoding, the output is unspecified. At least some of the code currently resorts to serializing it as an SGML Numeric Character Reference (not an entity reference, despite the comments), on the argument that while it's wrong it might be better than nothing; at least you can see what happened. We *should* also be reporting a Warning with that, precisely because it is undefined behavior; maybe even (optionally?) treating it as an Error if users want to ensure portability. > ToXMLStream does not support unicode supplementary characters > ------------------------------------------------------------- > > Key: XALANJ-2560 > URL: https://issues.apache.org/jira/browse/XALANJ-2560 > Project: XalanJ2 > Issue Type: Bug > Security Level: No security risk; visible to anyone(Ordinary problems in > Xalan projects. Anybody can view the issue.) > Components: Serialization > Affects Versions: 2.7.1 > Environment: Xalan 2.7.1 serializer. > Tested on Ubuntu 12.04 with Oracle JDK 1.7.0_05. > Reporter: Damien Guillaume > Assignee: Joe Kesselman > Priority: Major > Labels: serialization, unicode > > org.apache.xml.serializer.ToXMLStream (which extends ToStream) does not > support serialization of unicode supplementary characters such as U+1D49C. It > creates invalid characters entities like "��" instead of > "𝒜" (or F0 9D 92 9C with UTF-8). ToXMLStream is used by LSSerializer > when Xalan's serializer is on the classpath. > org.apache.xml.serialize.DOMSerializerImpl (included in Xerces) does not have > this problem, but it is deprecated since Xerces 2.9.0, so this is a > regression. > See > http://stackoverflow.com/questions/11952289/serializing-supplementary-unicode-characters-into-xml-documents-with-java > for more details. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@xalan.apache.org For additional commands, e-mail: dev-h...@xalan.apache.org