[ https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14709218#comment-14709218 ]
Jason Harrop commented on XALANJ-2419: -------------------------------------- If you just want to use TransformerIdentityImpl with astral characters, it is possible to repackage little more than org.apache.xml.serializer (and apply the fix in this issue). You also need to modify the *.properties files in that package, to point to your new repackaging. In addition to that package, I repackaged TransformerIdentityImpl, SerializerSwitcher and OutputProperties. Of course, life would be simpler if the Xalan maintainers would just make a 2.7.3 fixing this 7 year old bug. > Astral characters written as a pair of NCRs with the surrogate scalar values > when using UTF-8 > --------------------------------------------------------------------------------------------- > > Key: XALANJ-2419 > URL: https://issues.apache.org/jira/browse/XALANJ-2419 > Project: XalanJ2 > Issue Type: Bug > Components: Serialization > Affects Versions: 2.7.1 > Reporter: Henri Sivonen > Attachments: XALANJ-2419-fix.txt, XALANJ-2419-tests.txt > > > org.apache.xml.serializer.ToStream contains the following code: > else if (m_encodingInfo.isInEncoding(ch)) { > // If the character is in the encoding, and > // not in the normal ASCII range, we also > // just leave it get added on to the clean characters > > } > else { > // This is a fallback plan, we should never get here > // but if the character wasn't previously handled > // (i.e. isn't in the encoding, etc.) then what > // should we do? We choose to write out an entity > writeOutCleanChars(chars, i, lastDirtyCharProcessed); > writer.write("&#"); > writer.write(Integer.toString(ch)); > writer.write(';'); > lastDirtyCharProcessed = i; > } > This leads to the wrong (latter) if branch running for surrogates, because > isInEncoding() for UTF-8 returns false for surrogates. It is always wrong > (regardless of encoding) to escape a surrogate as an NCR. > The practical effect of this bug is that any document with astral characters > in it ends up in an ill-formed serialization and does not parse back using an > XML parser. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@xalan.apache.org For additional commands, e-mail: dev-h...@xalan.apache.org