Re: XML Entities

Stanimir Stamenkov Sat, 02 Apr 2022 13:57:29 -0700

Thu, 31 Mar 2022 23:05:47 -0700, /Thangalin/:

Back in 2013, a question was asked about how to preserve entities (e.g.,unicode and emojis) when transforming:
"My XSLT transformations have been successful for months until I ranacross an XML file with Unicode characters (emoji characters). I need topreserve the Unicode but XSLT is converting it to HTML Entities. Ithought that setting the encoding to UTF-8 would solve my problem butI'm still having issues."
The answer was to look at the 'xalan:entities' serializer:

http://xml.apache.org/xalan-j/usagepatterns.html#outputprops
I've switched from Xalan to Saxon to handle the conversion flawlessly,using a single line of code:
       System.setProperty(
         "javax.xml.transform.TransformerFactory",
         "net.sf.saxon.TransformerFactoryImpl" );
The downside is adding 6MB to encode emojis, which Xalan is alreadydoing, just not quite as needed (&#55357;&#56397; is generated insteadof 👍, for example).
Is there an example showing how to use the xalan:entities serializer topreserve entities?

Let's clarify &#55357; &#56397; 👍 are character (Unicode codepoint) references and not (named) entity references. For setting upyour own xalan:entities I guess you could have a look at the source:

*http://svn.apache.org/viewvc/xalan/java/trunk/src/org/apache/xml/serializer/XMLEntities.properties?view=markup*http://svn.apache.org/viewvc/xalan/java/trunk/src/org/apache/xml/serializer/HTMLEntities.properties?view=markup

You may notice these provide mapping between character (code point) andentity name to substitute in the result. However your problem appearsthat Xalan doesn't support non-BMP (past the Basic Multilingual Plane)code points > Hex: FFFF (Dec: 65535). The java char type can'trepresent any Unicode code point – it is just a UTF-16 unit. Thus anon-BMP character is encoded into two char values – a surrogate-pair.Java 5 introduced APIs for decoding these to a Unicode code point forexample:

*https://docs.oracle.com/javase/8/docs/api/java/lang/String.html#codePointAt-int-


but Xalan doesn't seem to support non-BMP characters currently/still:

*   https://issues.apache.org/jira/browse/XALANJ-2595

FWIW, the following example works as expected with the forked Xalanversion included in the Oracle/OpenJDK:


import java.io.StringReader;
import javax.xml.transform.OutputKeys;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.stream.StreamResult;
import javax.xml.transform.stream.StreamSource;

public class TransformTest {

    public static void main(String[] args) throws Exception {
        TransformerFactory tf = TransformerFactory.newInstance();
        Transformer transformer = tf.newTransformer();
        transformer.setOutputProperty(OutputKeys.ENCODING, "US-ASCII");
        transformer.setOutputProperty(OutputKeys.INDENT, "yes");
        transformer.setOutputProperty(
                OutputKeys.OMIT_XML_DECLARATION, "yes");

        String xmlSource = "<foo>&#x1F44D;</foo>";
        transformer.transform(
                new StreamSource(new StringReader(xmlSource)),
                new StreamResult(System.out));
    }

}

I'm getting a result of:

    <foo>&#128077;</foo>

Plugging in the official Xalan, I'm getting:

    <foo>&#55357;&#56397;</foo>

--
Stanimir

Re: XML Entities

Reply via email to