[jira] [Commented] (LANG-1406) StringIndexOutOfBoundsException in StringUtils.replaceIgnoreCase

ASF GitHub Bot (JIRA) Thu, 09 Aug 2018 01:46:15 -0700


    [ 
https://issues.apache.org/jira/browse/LANG-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16574490#comment-16574490
 ]


ASF GitHub Bot commented on LANG-1406:
--------------------------------------

Github user kinow commented on the issue:

    https://github.com/apache/commons-lang/pull/340
  
    Oh, that does make sense now. So the first visible character we see is the 
["Latin Capital Letter I with Dot Above"](https://unicode-table.com/en/#0130) 
(see also [this other 
link](https://en.wikipedia.org/wiki/Dotted_and_dotless_I)), and the second an 
`x`. And doing `toUpperCase()` simply won't change it as it's considered 
already upper case.
    
    When doing a `toLowerCase`, it gets translated into two visible characters. 
The second is the normal `x`. While the first contains two codepoints. I tested 
in Python, and got the lower case `i` (`print(u"\u0069")`) followed by a 
character invisible by itself (`print (u"\u0307")`).
    
    The special/invisible character, is visible when coming after certain 
letters.
    
    ```python
    >>> print(u"\u0307")
    
    >>> print(u"\u0069\u0307")
    i̇
    >>> print(u"\u0068\u0307")
    ḣ
    >>> print(u"\u0067\u0307")
    ġ
    >>> print(u"\u0067\u0307")
    ```
    
    When we get these invisible characters, as we have one code point more, the 
length returned is not 2, but 3. Resulting in exception in this issue.
    
    I don't believe the fix here would fix the reverse case, where we had a 
lower case, single codepoint, unicode; that would be represented by a two code 
codepoint. The exception could happen again (I haven't investigated whether 
such case exist, but I'm assuming there could be such case - if not now, maybe 
a character could still be added in future editions).
    
    What do you think @HiuKwok ? Any suggestions? I'm not sure if there's any 
easy way to fix this case, except by adding a note to the documentation saying 
that the method is not intended to be used with unicode strings, as it doesn't 
handle supplementary characters well. Or maybe we could try to remove the 
`length()` call around the `StringBuilder`'s near the end of the method...


> StringIndexOutOfBoundsException in StringUtils.replaceIgnoreCase
> ----------------------------------------------------------------
>
>                 Key: LANG-1406
>                 URL: https://issues.apache.org/jira/browse/LANG-1406
>             Project: Commons Lang
>          Issue Type: Bug
>          Components: lang.*
>            Reporter: Michael Ryan
>            Priority: Major
>
> STEPS TO REPRODUCE:
> {code}
> StringUtils.replaceIgnoreCase("\u0130x", "x", "")
> {code}
> EXPECTED: "\u0130" is returned.
> ACTUAL: StringIndexOutOfBoundsException
> This happens because the replace method is assuming that text.length() == 
> text.toLowerCase().length(), which is not true for certain characters.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (LANG-1406) StringIndexOutOfBoundsException in StringUtils.replaceIgnoreCase

Reply via email to