[ https://issues.apache.org/jira/browse/LANG-1451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16830262#comment-16830262 ]
Gary Gregory commented on LANG-1451: ------------------------------------ Or we could have a new class {{CodePointStringUtils}} (or another name) to make it clear that the domain of the functions are about different kinds of strings than char[]. > Should there be a better implementation of substring that deals with Unicode > surrogate pairs correctly? > ------------------------------------------------------------------------------------------------------- > > Key: LANG-1451 > URL: https://issues.apache.org/jira/browse/LANG-1451 > Project: Commons Lang > Issue Type: New Feature > Affects Versions: 3.9 > Environment: Any > Reporter: Sebastiaan > Assignee: Rob Tompkins > Priority: Minor > Labels: features > > There are some major problems with Java's substring implementation which > works using chars. For a brief overview read this blog post: > [https://codeahoy.com/2016/05/08/the-char-type-in-java-is-broken/] > > I have some demo code showing the issues and a possible solution here: > {code:java} > public class SubstringTest { > public static void main(String[] args) { > String stringWithPlus2ByteCodePoints = "👦👩👪👫"; > String substring1 = stringWithPlus2ByteCodePoints.substring(0, 1); > String substring2 = stringWithPlus2ByteCodePoints.substring(0, 2); > String substring3 = stringWithPlus2ByteCodePoints.substring(1, 3); > System.out.println(stringWithPlus2ByteCodePoints); > System.out.println("invalid sub: " + substring1); > System.out.println("invalid sub: " + substring2); > System.out.println("invalid sub: " + substring3); > String realSub1 = getRealSubstring(stringWithPlus2ByteCodePoints, 0, > 1); > String realSub2 = getRealSubstring(stringWithPlus2ByteCodePoints, 0, > 2); > String realSub3 = getRealSubstring(stringWithPlus2ByteCodePoints, 1, > 3); > System.out.println("real sub: " + realSub1); > System.out.println("real sub: " + realSub2); > System.out.println("real sub: " + realSub3); > } > private static String getRealSubstring(String string, int beginIndex, int > endIndex) { > if (string == null) > throw new IllegalArgumentException("String should not be null"); > int length = string.length(); > if (endIndex < 0 || beginIndex > endIndex || beginIndex > length || > endIndex > length) > throw new IllegalArgumentException("Invalid indices"); > int realBeginIndex = string.offsetByCodePoints(0, beginIndex); > int realEndIndex = string.offsetByCodePoints(0, endIndex); > return string.substring(realBeginIndex, realEndIndex); > } > }{code} > The output is: > {noformat} > 👦👩👪👫 > invalid sub: ? > invalid sub: 👦 > invalid sub: ?? > real sub: 👦 > real sub: 👦👩 > real sub: 👩👪{noformat} > > The same issues appear in Apache Commons Text's substring method. > Should Apache Commons Text use this code or something similar in the > substring implementation, rather than the flawed Java substring method? Or at > least offer an additional utility method that does take a string with unicode > codepoints that require surrogate pairs and substrings it correctly? > I also posted my implementation at > [https://stackoverflow.com/questions/55663213/java-substring-by-code-point-indices-treating-pairs-of-surrogate-code-units-as/] > asking for advice and there is a more robust version as an answer. -- This message was sent by Atlassian JIRA (v7.6.3#76005)