[ 
https://issues.apache.org/jira/browse/LANG-862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13527896#comment-13527896
 ] 

Michael Houston commented on LANG-862:
--------------------------------------

Apologies, I see this is fixed in the latests SVN - should have browsed the 
source code first!
                
> CharSequenceTranslator causes StringIndexOutOfBoundsException during 
> translation of unicode codepoints with length > 1 character
> --------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: LANG-862
>                 URL: https://issues.apache.org/jira/browse/LANG-862
>             Project: Commons Lang
>          Issue Type: Bug
>          Components: lang.text.translate.*
>    Affects Versions: 3.1
>         Environment: OS X, Java 1.6
>            Reporter: Michael Houston
>              Labels: bug, text, unicode
>
> When translating a string with unicode characters in, I've encountered an 
> index exception:
> {code}
>       java.lang.StringIndexOutOfBoundsException: String index out of range: 50
>       at java.lang.String.charAt(String.java:686)
>       at java.lang.Character.codePointAt(Character.java:2335)
>       at 
> org.apache.commons.lang3.text.translate.CharSequenceTranslator.translate(CharSequenceTranslator.java:95)
>       at 
> org.apache.commons.lang3.text.translate.CharSequenceTranslator.translate(CharSequenceTranslator.java:59)
>       at 
> org.apache.commons.lang3.StringEscapeUtils.escapeCsv(StringEscapeUtils.java:556)
>       ...
> {code}
> The input string was from a twitter status:
> org.apache.commons.lang3.StringEscapeUtils.escapeCsv("pink & black adidas 
> suit for this rainy weather \ud83d\udc4d");
> Both those characters are 'Invalid' unicode characters, so presumably there 
> is a conversion error somewhere. However, this shouldn't cause the translator 
> to crash.
> At line 94, the loop which generates the exception increments the position by 
> the size of the codepoint, which seems to grow faster than the number of 
> characters. I don't really know how codepoints work, but it looks to me like 
> there are two indexes which are treated as if they are the same one by this 
> loop:
>  * pt is incrementing by one character each iteration
>  * pos is incrementing by one or more characters each iteration
>  * pos is being used to index into the character array
>  * pt is the value actually being tested in the loop test, so pos can be 
> bigger than pt, causing an index problem at the end of the array
> My guess would be that the loop should read something like:
> {code}
>             for (int pt = 0; pt < consumed;) {
>                 int count = Character.charCount(Character.codePointAt(input, 
> pos));
>                 pt += count;
>                 pos += count;
>             }
> {code}
> I'm not sure if that was the intention, hope it makes some sense!
> Stepping through that code with the input string " \ud83d\udc4d":
> * the input string becomes " \ud83d\udc4d\u008d" (appended 'Reverse Line 
> Feed' - no idea why)
> * consumed == 4
> * Iterating the loop gives pt=0, pos=0 -> pt=1, pos=1 -> pt=2, pos=3 -> pt-3, 
> pos=4 (Index exception)
> So \ud83d\udc4d seems to be a codepoint with a width of 2, which puts the 
> index off by one after that.
> Anyway, hope that helps,
> Regards,
> Mike.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to