[ 
https://issues.apache.org/jira/browse/LANG-977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Karcher updated LANG-977:
-------------------------------

    Description: 
NumericEntityEscaper will incorrectly encode supplementary unicode characters 
depending on the char length of the first code point in the string.

To reproduce, run:
{code}
String escaped = NumericEntityEscaper.between(0x7f, 
Integer.MAX_VALUE).translate("a \uD83D\uDC14 \uD83D\uDCA9");
{code}

Expected:
{code}
escaped == "a 🐔 💩"
{code}

Actual:
{code}
escaped == "a 🐔� 💩�"
{code}

The issue lies in CharSequenceTranslator.translate() and the way it checks code 
points to figure out how many characters it needs to consume.  Specifically, 
the issue is on [line 
95|https://github.com/apache/commons-lang/blob/trunk/src/main/java/org/apache/commons/lang3/text/translate/CharSequenceTranslator.java#L95]:

{code}
// contract with translators is that they have to understand codepoints 
// and they just took care of a surrogate pair
for (int pt = 0; pt < consumed; pt++) {
    pos += Character.charCount(Character.codePointAt(input, pt));
}
{code}

The point of this code is to check the charCount of the character that was just 
translated and move ahead by that many characters in the input string.  The bug 
is that it's indexing into the string using 'pt', which is _always_ 0 at the 
beginning of the loop.  It's effetively checking the charCount of first 
character in the string every time.

A patch is attached that fixes the issue and includes supporting unit tests.  
Fixing this issue in CharSequenceTranslator uncovered an issue in 
CsvEscaper/CsvUnescaper caused by the fact that it wasn't respecting the "code 
point contract" described in CharSequenceTranslator.translate.  The fix there 
was to have the translate methods return the string's code point count rather 
than character count.

  was:
NumericEntityEscaper will incorrectly encode supplementary unicode characters 
depending on the char length of the first code point in the string.

To reproduce, run:
{code}
String escaped = NumericEntityEscaper.between(0x7f, 
Integer.MAX_VALUE).translate("a \uD83D\uDC14 \uD83D\uDCA9");
{code}

Expected:
{code}
escaped == "a &#128020; &#128169;"
{code}

Actual:
{code}
escaped == "a &#128020;&#56340; &#128169;&#56489;
{code}

The issue lies in CharSequenceTranslator.translate() and the way it checks code 
points to figure out how many characters it needs to consume.  Specifically, 
the issue is on [line 
95|https://github.com/apache/commons-lang/blob/trunk/src/main/java/org/apache/commons/lang3/text/translate/CharSequenceTranslator.java#L95]:

{code}
// contract with translators is that they have to understand codepoints 
// and they just took care of a surrogate pair
for (int pt = 0; pt < consumed; pt++) {
    pos += Character.charCount(Character.codePointAt(input, pt));
}
{code}

The point of this code is to check the charCount of the character that was just 
translated and move ahead by that many characters in the input string.  The bug 
is that it's indexing into the string using 'pt', which is _always_ 0 at the 
beginning of the loop.  It's effetively checking the charCount of first 
character in the string every time.

A patch is attached that fixes the issue and includes supporting unit tests.  
Fixing this issue in CharSequenceTranslator uncovered an issue in 
CsvEscaper/CsvUnescaper caused by the fact that it wasn't respecting the "code 
point contract" described in CharSequenceTranslator.translate.  The fix there 
was to have the translate methods return the string's code point count rather 
than character count.


> NumericEntityEscaper incorrectly encodes supplementary characters
> -----------------------------------------------------------------
>
>                 Key: LANG-977
>                 URL: https://issues.apache.org/jira/browse/LANG-977
>             Project: Commons Lang
>          Issue Type: Bug
>          Components: lang.text.translate.*
>    Affects Versions: 3.2.1
>            Reporter: Chris Karcher
>         Attachments: NumericEntityEscaper.patch
>
>
> NumericEntityEscaper will incorrectly encode supplementary unicode characters 
> depending on the char length of the first code point in the string.
> To reproduce, run:
> {code}
> String escaped = NumericEntityEscaper.between(0x7f, 
> Integer.MAX_VALUE).translate("a \uD83D\uDC14 \uD83D\uDCA9");
> {code}
> Expected:
> {code}
> escaped == "a &#128020; &#128169;"
> {code}
> Actual:
> {code}
> escaped == "a &#128020;&#56340; &#128169;&#56489;"
> {code}
> The issue lies in CharSequenceTranslator.translate() and the way it checks 
> code points to figure out how many characters it needs to consume.  
> Specifically, the issue is on [line 
> 95|https://github.com/apache/commons-lang/blob/trunk/src/main/java/org/apache/commons/lang3/text/translate/CharSequenceTranslator.java#L95]:
> {code}
> // contract with translators is that they have to understand codepoints 
> // and they just took care of a surrogate pair
> for (int pt = 0; pt < consumed; pt++) {
>     pos += Character.charCount(Character.codePointAt(input, pt));
> }
> {code}
> The point of this code is to check the charCount of the character that was 
> just translated and move ahead by that many characters in the input string.  
> The bug is that it's indexing into the string using 'pt', which is _always_ 0 
> at the beginning of the loop.  It's effetively checking the charCount of 
> first character in the string every time.
> A patch is attached that fixes the issue and includes supporting unit tests.  
> Fixing this issue in CharSequenceTranslator uncovered an issue in 
> CsvEscaper/CsvUnescaper caused by the fact that it wasn't respecting the 
> "code point contract" described in CharSequenceTranslator.translate.  The fix 
> there was to have the translate methods return the string's code point count 
> rather than character count.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to