[jira] Commented: (LANG-617) StringEscapeUtils.escapeXML() can't process UTF-16 supplementary characters

Henri Yandell (JIRA) Wed, 14 Apr 2010 00:49:22 -0700

    [ 
https://issues.apache.org/jira/browse/LANG-617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12856784#action_12856784
 ]


Henri Yandell commented on LANG-617:
------------------------------------

In 3.0 the code has changed a fair amount. These will no longer escape for 
escapeXML and escapeHTML, but it's easy to turn the feature back on. The code 
is also now codepoint based, but I need to get a good unit test in for 
supplementary characters to decide what the code should actually do:

http://svn.apache.org/repos/asf/commons/proper/lang/trunk/src/main/java/org/apache/commons/lang3/text/translate/UnicodeEscaper.java

> StringEscapeUtils.escapeXML() can't process UTF-16 supplementary characters
> ---------------------------------------------------------------------------
>
>                 Key: LANG-617
>                 URL: https://issues.apache.org/jira/browse/LANG-617
>             Project: Commons Lang
>          Issue Type: Bug
>          Components: lang.*
>    Affects Versions: 2.4
>            Reporter: David Garcia
>            Priority: Minor
>
> Supplementary characters in UTF-16 are those whose code points are above 
> 0xffff, that is, require more than 1 Java char to be encoded, as explained 
> here: http://java.sun.com/developer/technicalArticles/Intl/Supplementary/
> Currently, StringEscapeUtils.escapeXML() isn't aware of this coding scheme 
> and treats each char as one character, which is not always right.
> A possible solution in class Entities would be:
>     public void escape(Writer writer, String str) throws IOException {
>         int len = str.length();
>         for (int i = 0; i < len; i++) {
>             int code = str.codePointAt(i);
>             String entityName = this.entityName(code);
>             if (entityName != null) {
>                 writer.write('&');
>                 writer.write(entityName);
>                 writer.write(';');
>             } else if (code > 0x7F) {
>                     writer.write("&#");
>                     writer.write(code);
>                     writer.write(';');
>             } else {
>                     writer.write((char) code);
>             }
>             if (code > 0xffff) {
>                     i++;
>             }
>         }
>     }
> Besides fixing escapeXML(), this will also affect HTML escaping functions. I 
> guess that's a good thing, but please remember I have only tested escapeXML().

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (LANG-617) StringEscapeUtils.escapeXML() can't process UTF-16 supplementary characters

Reply via email to