Thu, 31 Mar 2022 23:05:47 -0700, /Thangalin/:
Back in 2013, a question was asked about how to preserve entities (e.g.,
unicode and emojis) when transforming:
"My XSLT transformations have been successful for months until I ran
across an XML file with Unicode characters (emoji characters). I need to
preserve the Unicode but XSLT is converting it to HTML Entities. I
thought that setting the encoding to UTF-8 would solve my problem but
I'm still having issues."
The answer was to look at the 'xalan:entities' serializer:
http://xml.apache.org/xalan-j/usagepatterns.html#outputprops
I've switched from Xalan to Saxon to handle the conversion flawlessly,
using a single line of code:
System.setProperty(
"javax.xml.transform.TransformerFactory",
"net.sf.saxon.TransformerFactoryImpl" );
The downside is adding 6MB to encode emojis, which Xalan is already
doing, just not quite as needed (�� is generated instead
of 👍, for example).
Is there an example showing how to use the xalan:entities serializer to
preserve entities?
Let's clarify � � 👍 are character (Unicode code
point) references and not (named) entity references. For setting up
your own xalan:entities I guess you could have a look at the source:
*
http://svn.apache.org/viewvc/xalan/java/trunk/src/org/apache/xml/serializer/XMLEntities.properties?view=markup
*
http://svn.apache.org/viewvc/xalan/java/trunk/src/org/apache/xml/serializer/HTMLEntities.properties?view=markup
You may notice these provide mapping between character (code point) and
entity name to substitute in the result. However your problem appears
that Xalan doesn't support non-BMP (past the Basic Multilingual Plane)
code points > Hex: FFFF (Dec: 65535). The java char type can't
represent any Unicode code point – it is just a UTF-16 unit. Thus a
non-BMP character is encoded into two char values – a surrogate-pair.
Java 5 introduced APIs for decoding these to a Unicode code point for
example:
*
https://docs.oracle.com/javase/8/docs/api/java/lang/String.html#codePointAt-int-
but Xalan doesn't seem to support non-BMP characters currently/still:
* https://issues.apache.org/jira/browse/XALANJ-2595
FWIW, the following example works as expected with the forked Xalan
version included in the Oracle/OpenJDK:
import java.io.StringReader;
import javax.xml.transform.OutputKeys;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.stream.StreamResult;
import javax.xml.transform.stream.StreamSource;
public class TransformTest {
public static void main(String[] args) throws Exception {
TransformerFactory tf = TransformerFactory.newInstance();
Transformer transformer = tf.newTransformer();
transformer.setOutputProperty(OutputKeys.ENCODING, "US-ASCII");
transformer.setOutputProperty(OutputKeys.INDENT, "yes");
transformer.setOutputProperty(
OutputKeys.OMIT_XML_DECLARATION, "yes");
String xmlSource = "<foo>👍</foo>";
transformer.transform(
new StreamSource(new StringReader(xmlSource)),
new StreamResult(System.out));
}
}
I'm getting a result of:
<foo>👍</foo>
Plugging in the official Xalan, I'm getting:
<foo>��</foo>
--
Stanimir