Thu, 31 Mar 2022 23:05:47 -0700, /Thangalin/:

Back in 2013, a question was asked about how to preserve entities (e.g., unicode and emojis) when transforming:

"My XSLT transformations have been successful for months until I ran across an XML file with Unicode characters (emoji characters). I need to preserve the Unicode but XSLT is converting it to HTML Entities. I thought that setting the encoding to UTF-8 would solve my problem but I'm still having issues."

The answer was to look at the 'xalan:entities' serializer:

http://xml.apache.org/xalan-j/usagepatterns.html#outputprops

I've switched from Xalan to Saxon to handle the conversion flawlessly, using a single line of code:

       System.setProperty(
         "javax.xml.transform.TransformerFactory",
         "net.sf.saxon.TransformerFactoryImpl" );

The downside is adding 6MB to encode emojis, which Xalan is already doing, just not quite as needed (�� is generated instead of 👍, for example).

Is there an example showing how to use the xalan:entities serializer to preserve entities?

Let's clarify � � 👍 are character (Unicode code point) references and not (named) entity references. For setting up your own xalan:entities I guess you could have a look at the source:

* http://svn.apache.org/viewvc/xalan/java/trunk/src/org/apache/xml/serializer/XMLEntities.properties?view=markup * http://svn.apache.org/viewvc/xalan/java/trunk/src/org/apache/xml/serializer/HTMLEntities.properties?view=markup

You may notice these provide mapping between character (code point) and entity name to substitute in the result. However your problem appears that Xalan doesn't support non-BMP (past the Basic Multilingual Plane) code points > Hex: FFFF (Dec: 65535). The java char type can't represent any Unicode code point – it is just a UTF-16 unit. Thus a non-BMP character is encoded into two char values – a surrogate-pair. Java 5 introduced APIs for decoding these to a Unicode code point for example:

* https://docs.oracle.com/javase/8/docs/api/java/lang/String.html#codePointAt-int-

but Xalan doesn't seem to support non-BMP characters currently/still:

*   https://issues.apache.org/jira/browse/XALANJ-2595

FWIW, the following example works as expected with the forked Xalan version included in the Oracle/OpenJDK:

import java.io.StringReader;
import javax.xml.transform.OutputKeys;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.stream.StreamResult;
import javax.xml.transform.stream.StreamSource;

public class TransformTest {

    public static void main(String[] args) throws Exception {
        TransformerFactory tf = TransformerFactory.newInstance();
        Transformer transformer = tf.newTransformer();
        transformer.setOutputProperty(OutputKeys.ENCODING, "US-ASCII");
        transformer.setOutputProperty(OutputKeys.INDENT, "yes");
        transformer.setOutputProperty(
                OutputKeys.OMIT_XML_DECLARATION, "yes");

        String xmlSource = "<foo>&#x1F44D;</foo>";
        transformer.transform(
                new StreamSource(new StringReader(xmlSource)),
                new StreamResult(System.out));
    }

}

I'm getting a result of:

    <foo>&#128077;</foo>

Plugging in the official Xalan, I'm getting:

    <foo>&#55357;&#56397;</foo>

--
Stanimir

Reply via email to