[ 
https://issues.apache.org/jira/browse/LANG-977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13902425#comment-13902425
 ] 

Chris Karcher commented on LANG-977:
------------------------------------

Thanks for the quick merge!  Do you know when the 3.3 release will be cut?

> NumericEntityEscaper incorrectly encodes supplementary characters
> -----------------------------------------------------------------
>
>                 Key: LANG-977
>                 URL: https://issues.apache.org/jira/browse/LANG-977
>             Project: Commons Lang
>          Issue Type: Bug
>          Components: lang.text.translate.*
>    Affects Versions: 3.2.1
>            Reporter: Chris Karcher
>            Assignee: Benedikt Ritter
>             Fix For: 3.3
>
>         Attachments: NumericEntityEscaper.patch
>
>
> NumericEntityEscaper will incorrectly encode supplementary unicode characters 
> depending on the char length of the first code point in the string.
> To reproduce, run:
> {code}
> String escaped = NumericEntityEscaper.between(0x7f, 
> Integer.MAX_VALUE).translate("a \uD83D\uDC14 \uD83D\uDCA9");
> {code}
> Expected:
> {code}
> escaped == "a 🐔 💩"
> {code}
> Actual:
> {code}
> escaped == "a 🐔� 💩�"
> {code}
> The issue lies in CharSequenceTranslator.translate() and the way it checks 
> code points to figure out how many characters it needs to consume.  
> Specifically, the issue is on [line 
> 95|https://github.com/apache/commons-lang/blob/trunk/src/main/java/org/apache/commons/lang3/text/translate/CharSequenceTranslator.java#L95]:
> {code}
> // contract with translators is that they have to understand codepoints 
> // and they just took care of a surrogate pair
> for (int pt = 0; pt < consumed; pt++) {
>     pos += Character.charCount(Character.codePointAt(input, pt));
> }
> {code}
> The point of this code is to check the charCount of the character that was 
> just translated and move ahead by that many characters in the input string.  
> The bug is that it's indexing into the string using 'pt', which is _always_ 0 
> at the beginning of the loop.  It's effetively checking the charCount of 
> first character in the string every time.
> A patch is attached that fixes the issue and includes supporting unit tests.  
> Fixing this issue in CharSequenceTranslator uncovered an issue in 
> CsvEscaper/CsvUnescaper caused by the fact that it wasn't respecting the 
> "code point contract" described in CharSequenceTranslator.translate.  The fix 
> there was to have the translate methods return the string's code point count 
> rather than character count.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to