[jira] [Commented] (TEXT-161) Should there be a better implementation of substring that deals with Unicode surrogate pairs correctly?

Rob Tompkins (JIRA) Sun, 14 Apr 2019 08:01:47 -0700


    [ 
https://issues.apache.org/jira/browse/TEXT-161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16817329#comment-16817329
 ]


Rob Tompkins commented on TEXT-161:
-----------------------------------

Looks like something up my alley

> Should there be a better implementation of substring that deals with Unicode 
> surrogate pairs correctly?
> -------------------------------------------------------------------------------------------------------
>
>                 Key: TEXT-161
>                 URL: https://issues.apache.org/jira/browse/TEXT-161
>             Project: Commons Text
>          Issue Type: New Feature
>    Affects Versions: 1.6
>         Environment: Any
>            Reporter: Sebastiaan
>            Assignee: Rob Tompkins
>            Priority: Minor
>              Labels: features
>
> There are some major problems with Java's substring implementation which 
> works using chars. For a brief overview read this blog post: 
> [https://codeahoy.com/2016/05/08/the-char-type-in-java-is-broken/]
>  
> I have some demo code showing the issues and a possible solution here:
> {code:java}
> public class SubstringTest {
>     public static void main(String[] args) {
>         String stringWithPlus2ByteCodePoints = "👦👩👪👫";
>         String substring1 = stringWithPlus2ByteCodePoints.substring(0, 1);
>         String substring2 = stringWithPlus2ByteCodePoints.substring(0, 2);
>         String substring3 = stringWithPlus2ByteCodePoints.substring(1, 3);
>         System.out.println(stringWithPlus2ByteCodePoints);
>         System.out.println("invalid sub: " + substring1);
>         System.out.println("invalid sub: " + substring2);
>         System.out.println("invalid sub: " + substring3);
>         String realSub1 = getRealSubstring(stringWithPlus2ByteCodePoints, 0, 
> 1);
>         String realSub2 = getRealSubstring(stringWithPlus2ByteCodePoints, 0, 
> 2);
>         String realSub3 = getRealSubstring(stringWithPlus2ByteCodePoints, 1, 
> 3);
>         System.out.println("real sub: " + realSub1);
>         System.out.println("real sub: " + realSub2);
>         System.out.println("real sub: " + realSub3);
>     }
>     private static String getRealSubstring(String string, int beginIndex, int 
> endIndex) {
>         if (string == null)
>             throw new IllegalArgumentException("String should not be null");
>         int length = string.length();
>         if (endIndex < 0 || beginIndex > endIndex || beginIndex > length || 
> endIndex > length)
>             throw new IllegalArgumentException("Invalid indices");
>         int realBeginIndex = string.offsetByCodePoints(0, beginIndex);
>         int realEndIndex = string.offsetByCodePoints(0, endIndex);
>         return string.substring(realBeginIndex, realEndIndex);
>     }
> }{code}
> The output is:
> {noformat}
> 👦👩👪👫
> invalid sub: ?
> invalid sub: 👦
> invalid sub: ??
> real sub: 👦
> real sub: 👦👩
> real sub: 👩👪{noformat}
>  
> The same issues appear in Apache Commons Text's substring method.
> Should Apache Commons Text use this code or something similar in the 
> substring implementation, rather than the flawed Java substring method? Or at 
> least offer an additional utility method that does take a string with unicode 
> codepoints that require surrogate pairs and substrings it correctly?
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (TEXT-161) Should there be a better implementation of substring that deals with Unicode surrogate pairs correctly?

Reply via email to