[jira] [Comment Edited] (LANG-1300) Clarify or improve behaviour of int-based methods in StringUtils

Rob Tompkins (JIRA) Tue, 07 Mar 2017 05:21:07 -0800

    [ 
https://issues.apache.org/jira/browse/LANG-1300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15899421#comment-15899421
 ]


Rob Tompkins edited comment on LANG-1300 at 3/7/17 1:20 PM:
------------------------------------------------------------

Github user chtompki commented on the issue:

    https://github.com/apache/commons-lang/pull/251
  
    @dmjones500 - no worries on the being busy, we all end up there for time to 
time... :-) 
    
    @dmjones500 has an interesting point. The problem seems to lie with the 
number of "Supplementary Code Points" preceding the *findable* `searchChar` 
that have been previously split into their complementary surrogate pairs.  
    
    You may need to consider using `Character.isSurrogate(char ch)` as well as 
`Character.isSurrogatePair(char high, char low)` for all characters preceding 
our *findable* code point. Granted, that adds an *O( n )* multiplier on our 
method's efficiency pushing us to *O(n<sup>2</sup>)*. It feels like only then 
can we be absolutely certain that we are not over counting using *code units* 
as opposed to *code points*. 
    
    If indeed we do move this direction, we should be quite clear, in the 
javadoc, that there is a notable performance reduction when operating outside 
the "Basic Multilingual Plane" (ref. [Oracle's Character 
documentation](https://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#supplementary)).
    
    @PascalSchumacher - you have any thoughts here as well?



was (Author: githubbot):
Github user chtompki commented on the issue:

    https://github.com/apache/commons-lang/pull/251
  
    @dmjones500 - no worries on the being busy, we all end up there for time to 
time... :-) 
    
    @dmjones500 has an interesting point. The problem seems to lie with the 
number of "Supplementary Code Points" preceding the *findable* `searchChar` 
that have been previously split into their complementary surrogate pairs.  
    
    You may need to consider using `Character.isSurrogate(char ch)` as well as 
`Character.isSurrogatePair(char high, char low)` for all characters preceding 
our *findable* code point. Granted, that adds an *O(n)* multiplier on our 
method's efficiency pushing us to *O(n<sup>2</sup>)*. It feels like only then 
can we be absolutely certain that we are not over counting using *code units* 
as opposed to *code points*. 
    
    If indeed we do move this direction, we should be quite clear, in the 
javadoc, that there is a notable performance reduction when operating outside 
the "Basic Multilingual Plane" (ref. [Oracle's Character 
documentation](https://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#supplementary)).
    
    @PascalSchumacher - you have any thoughts here as well?


> Clarify or improve behaviour of int-based methods in StringUtils
> ----------------------------------------------------------------
>
>                 Key: LANG-1300
>                 URL: https://issues.apache.org/jira/browse/LANG-1300
>             Project: Commons Lang
>          Issue Type: Improvement
>          Components: lang.*
>    Affects Versions: 3.5
>            Reporter: Duncan Jones
>            Priority: Minor
>             Fix For: Discussion
>
>
> The following methods use an {{int}} to represent a search character:
> {code:java}
> boolean contains(final CharSequence seq, final int searchChar)
> int indexOf(final CharSequence seq, final int searchChar)
> int indexOf(final CharSequence seq, final int searchChar, final int startPos)
> int lastIndexOf(final CharSequence seq, final int searchChar)
> int lastIndexOf(final CharSequence seq, final int searchChar, final int 
> startPos)
> {code}
> When I see an {{int}} representing a character, I tend to assume the method 
> can handle supplementary characters. However, the current behaviour of these 
> methods depends upon whether the {{CharSequence}} is a {{String}} or not.
> {code:java}
> StringBuilder builder = new StringBuilder();
> builder.appendCodePoint(0x2070E);
> System.out.println(StringUtils.lastIndexOf(builder, 0x2070E)); // -1
> System.out.println(StringUtils.lastIndexOf(builder.toString(), 0x2070E)); // 0
> {code}
> The Javadoc for these methods are ambiguous on this point, stating:
> {quote}
> This method uses {{String.lastIndexOf(int)}} if possible.
> {quote}
> I think we should consider updating the {{CharSequenceUtils}} methods used by 
> this class to convert all {{CharSequence}} parameters to strings, enabling 
> full code point support. The docs could be updated to make this crystal clear.
> There is a question of whether this breaks backwards compatibility.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Comment Edited] (LANG-1300) Clarify or improve behaviour of int-based methods in StringUtils

Reply via email to