[jira] [Commented] (LANG-1451) Should there be a better implementation of substring that deals with Unicode surrogate pairs correctly?

Rob Tompkins (JIRA) Tue, 30 Apr 2019 06:02:09 -0700


    [ 
https://issues.apache.org/jira/browse/LANG-1451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16830274#comment-16830274
 ]


Rob Tompkins commented on LANG-1451:
------------------------------------

Actually....if we make our {{substring}} method the following:

{code:java}
public static String substring(final String str, int start, int end) {
        if (str == null) {
            return null;
        }

        // handle negatives
        if (end < 0) {
            end = str.length() + end; // remember end is negative
        }
        if (start < 0) {
            start = str.length() + start; // remember start is negative
        }

        // check length next
        if (end > str.length()) {
            end = str.length();
        }

        // if start is greater than end, return ""
        if (start > end) {
            return EMPTY;
        }

        if (start < 0) {
            start = 0;
        }
        if (end < 0) {
            end = 0;
        }
        int realUtf16Start = str.offsetByCodePoints(0, start);
        int realUtf16End = str.offsetByCodePoints(0, end);
        return str.substring(realUtf16Start, realUtf16End);
    }
{code}

we're just good. I think this is likely the best path.

> Should there be a better implementation of substring that deals with Unicode 
> surrogate pairs correctly?
> -------------------------------------------------------------------------------------------------------
>
>                 Key: LANG-1451
>                 URL: https://issues.apache.org/jira/browse/LANG-1451
>             Project: Commons Lang
>          Issue Type: New Feature
>    Affects Versions: 3.9
>         Environment: Any
>            Reporter: Sebastiaan
>            Assignee: Rob Tompkins
>            Priority: Minor
>              Labels: features
>
> There are some major problems with Java's substring implementation which 
> works using chars. For a brief overview read this blog post: 
> [https://codeahoy.com/2016/05/08/the-char-type-in-java-is-broken/]
>  
> I have some demo code showing the issues and a possible solution here:
> {code:java}
> public class SubstringTest {
>     public static void main(String[] args) {
>         String stringWithPlus2ByteCodePoints = "👦👩👪👫";
>         String substring1 = stringWithPlus2ByteCodePoints.substring(0, 1);
>         String substring2 = stringWithPlus2ByteCodePoints.substring(0, 2);
>         String substring3 = stringWithPlus2ByteCodePoints.substring(1, 3);
>         System.out.println(stringWithPlus2ByteCodePoints);
>         System.out.println("invalid sub: " + substring1);
>         System.out.println("invalid sub: " + substring2);
>         System.out.println("invalid sub: " + substring3);
>         String realSub1 = getRealSubstring(stringWithPlus2ByteCodePoints, 0, 
> 1);
>         String realSub2 = getRealSubstring(stringWithPlus2ByteCodePoints, 0, 
> 2);
>         String realSub3 = getRealSubstring(stringWithPlus2ByteCodePoints, 1, 
> 3);
>         System.out.println("real sub: " + realSub1);
>         System.out.println("real sub: " + realSub2);
>         System.out.println("real sub: " + realSub3);
>     }
>     private static String getRealSubstring(String string, int beginIndex, int 
> endIndex) {
>         if (string == null)
>             throw new IllegalArgumentException("String should not be null");
>         int length = string.length();
>         if (endIndex < 0 || beginIndex > endIndex || beginIndex > length || 
> endIndex > length)
>             throw new IllegalArgumentException("Invalid indices");
>         int realBeginIndex = string.offsetByCodePoints(0, beginIndex);
>         int realEndIndex = string.offsetByCodePoints(0, endIndex);
>         return string.substring(realBeginIndex, realEndIndex);
>     }
> }{code}
> The output is:
> {noformat}
> 👦👩👪👫
> invalid sub: ?
> invalid sub: 👦
> invalid sub: ??
> real sub: 👦
> real sub: 👦👩
> real sub: 👩👪{noformat}
>  
> The same issues appear in Apache Commons Text's substring method.
> Should Apache Commons Text use this code or something similar in the 
> substring implementation, rather than the flawed Java substring method? Or at 
> least offer an additional utility method that does take a string with unicode 
> codepoints that require surrogate pairs and substrings it correctly?
>  I also posted my implementation at 
> [https://stackoverflow.com/questions/55663213/java-substring-by-code-point-indices-treating-pairs-of-surrogate-code-units-as/]
>  asking for advice and there is a more robust version as an answer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (LANG-1451) Should there be a better implementation of substring that deals with Unicode surrogate pairs correctly?

Reply via email to