[ https://issues.apache.org/jira/browse/LANG-1451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16830274#comment-16830274 ]
Rob Tompkins commented on LANG-1451: ------------------------------------ Actually....if we make our {{substring}} method the following: {code:java} public static String substring(final String str, int start, int end) { if (str == null) { return null; } // handle negatives if (end < 0) { end = str.length() + end; // remember end is negative } if (start < 0) { start = str.length() + start; // remember start is negative } // check length next if (end > str.length()) { end = str.length(); } // if start is greater than end, return "" if (start > end) { return EMPTY; } if (start < 0) { start = 0; } if (end < 0) { end = 0; } int realUtf16Start = str.offsetByCodePoints(0, start); int realUtf16End = str.offsetByCodePoints(0, end); return str.substring(realUtf16Start, realUtf16End); } {code} we're just good. I think this is likely the best path. > Should there be a better implementation of substring that deals with Unicode > surrogate pairs correctly? > ------------------------------------------------------------------------------------------------------- > > Key: LANG-1451 > URL: https://issues.apache.org/jira/browse/LANG-1451 > Project: Commons Lang > Issue Type: New Feature > Affects Versions: 3.9 > Environment: Any > Reporter: Sebastiaan > Assignee: Rob Tompkins > Priority: Minor > Labels: features > > There are some major problems with Java's substring implementation which > works using chars. For a brief overview read this blog post: > [https://codeahoy.com/2016/05/08/the-char-type-in-java-is-broken/] > > I have some demo code showing the issues and a possible solution here: > {code:java} > public class SubstringTest { > public static void main(String[] args) { > String stringWithPlus2ByteCodePoints = "👦👩👪👫"; > String substring1 = stringWithPlus2ByteCodePoints.substring(0, 1); > String substring2 = stringWithPlus2ByteCodePoints.substring(0, 2); > String substring3 = stringWithPlus2ByteCodePoints.substring(1, 3); > System.out.println(stringWithPlus2ByteCodePoints); > System.out.println("invalid sub: " + substring1); > System.out.println("invalid sub: " + substring2); > System.out.println("invalid sub: " + substring3); > String realSub1 = getRealSubstring(stringWithPlus2ByteCodePoints, 0, > 1); > String realSub2 = getRealSubstring(stringWithPlus2ByteCodePoints, 0, > 2); > String realSub3 = getRealSubstring(stringWithPlus2ByteCodePoints, 1, > 3); > System.out.println("real sub: " + realSub1); > System.out.println("real sub: " + realSub2); > System.out.println("real sub: " + realSub3); > } > private static String getRealSubstring(String string, int beginIndex, int > endIndex) { > if (string == null) > throw new IllegalArgumentException("String should not be null"); > int length = string.length(); > if (endIndex < 0 || beginIndex > endIndex || beginIndex > length || > endIndex > length) > throw new IllegalArgumentException("Invalid indices"); > int realBeginIndex = string.offsetByCodePoints(0, beginIndex); > int realEndIndex = string.offsetByCodePoints(0, endIndex); > return string.substring(realBeginIndex, realEndIndex); > } > }{code} > The output is: > {noformat} > 👦👩👪👫 > invalid sub: ? > invalid sub: 👦 > invalid sub: ?? > real sub: 👦 > real sub: 👦👩 > real sub: 👩👪{noformat} > > The same issues appear in Apache Commons Text's substring method. > Should Apache Commons Text use this code or something similar in the > substring implementation, rather than the flawed Java substring method? Or at > least offer an additional utility method that does take a string with unicode > codepoints that require surrogate pairs and substrings it correctly? > I also posted my implementation at > [https://stackoverflow.com/questions/55663213/java-substring-by-code-point-indices-treating-pairs-of-surrogate-code-units-as/] > asking for advice and there is a more robust version as an answer. -- This message was sent by Atlassian JIRA (v7.6.3#76005)