[
https://issues.apache.org/jira/browse/LANG-617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12856840#action_12856840
]
David Garcia commented on LANG-617:
-----------------------------------
I am attaching two files with a very simple test case:
- utf8-fragment.txt
Contains an UTF-8 encoded string with a few BMP (Basic Multilingual Plane)
characters and 6 non-BMP characters (the ones causing trouble).
- xml-escaped-fragment.txt
Contains the expected XML-escaped string.
I hope you find it useful.
> StringEscapeUtils.escapeXML() can't process UTF-16 supplementary characters
> ---------------------------------------------------------------------------
>
> Key: LANG-617
> URL: https://issues.apache.org/jira/browse/LANG-617
> Project: Commons Lang
> Issue Type: Bug
> Components: lang.*
> Affects Versions: 2.4
> Reporter: David Garcia
> Priority: Minor
> Attachments: utf8-fragment.txt, xml-escaped-fragment.txt
>
>
> Supplementary characters in UTF-16 are those whose code points are above
> 0xffff, that is, require more than 1 Java char to be encoded, as explained
> here: http://java.sun.com/developer/technicalArticles/Intl/Supplementary/
> Currently, StringEscapeUtils.escapeXML() isn't aware of this coding scheme
> and treats each char as one character, which is not always right.
> A possible solution in class Entities would be:
> public void escape(Writer writer, String str) throws IOException {
> int len = str.length();
> for (int i = 0; i < len; i++) {
> int code = str.codePointAt(i);
> String entityName = this.entityName(code);
> if (entityName != null) {
> writer.write('&');
> writer.write(entityName);
> writer.write(';');
> } else if (code > 0x7F) {
> writer.write("&#");
> writer.write(code);
> writer.write(';');
> } else {
> writer.write((char) code);
> }
> if (code > 0xffff) {
> i++;
> }
> }
> }
> Besides fixing escapeXML(), this will also affect HTML escaping functions. I
> guess that's a good thing, but please remember I have only tested escapeXML().
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira