[ https://issues.apache.org/jira/browse/LANG-1451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16830242#comment-16830242 ]
Rob Tompkins commented on LANG-1451: ------------------------------------ [~Sebastiaan83] - when I try to even instantiate a string like one of those above, I am unable to do so. Do you have some special settings such that you can instantiate such strings? For example I get the followoing: {code:java} System.out.println("👦👩👪👫"); \\ ???? {code} The best thing that I could do to get the output was to do the following: {code:java} String PLUS_2_BYTE_CODE_POINTS = "\uD83D\uDC66\uD83D\uDC69\uD83D\uDC6A\uD83D\uDC6B"; {code} Does that work for you? > Should there be a better implementation of substring that deals with Unicode > surrogate pairs correctly? > ------------------------------------------------------------------------------------------------------- > > Key: LANG-1451 > URL: https://issues.apache.org/jira/browse/LANG-1451 > Project: Commons Lang > Issue Type: New Feature > Affects Versions: 3.9 > Environment: Any > Reporter: Sebastiaan > Assignee: Rob Tompkins > Priority: Minor > Labels: features > > There are some major problems with Java's substring implementation which > works using chars. For a brief overview read this blog post: > [https://codeahoy.com/2016/05/08/the-char-type-in-java-is-broken/] >  > I have some demo code showing the issues and a possible solution here: > {code:java} > public class SubstringTest { > public static void main(String[] args) { > String stringWithPlus2ByteCodePoints = "👦👩👪👫"; > String substring1 = stringWithPlus2ByteCodePoints.substring(0, 1); > String substring2 = stringWithPlus2ByteCodePoints.substring(0, 2); > String substring3 = stringWithPlus2ByteCodePoints.substring(1, 3); > System.out.println(stringWithPlus2ByteCodePoints); > System.out.println("invalid sub: " + substring1); > System.out.println("invalid sub: " + substring2); > System.out.println("invalid sub: " + substring3); > String realSub1 = getRealSubstring(stringWithPlus2ByteCodePoints, 0, > 1); > String realSub2 = getRealSubstring(stringWithPlus2ByteCodePoints, 0, > 2); > String realSub3 = getRealSubstring(stringWithPlus2ByteCodePoints, 1, > 3); > System.out.println("real sub: " + realSub1); > System.out.println("real sub: " + realSub2); > System.out.println("real sub: " + realSub3); > } > private static String getRealSubstring(String string, int beginIndex, int > endIndex) { > if (string == null) > throw new IllegalArgumentException("String should not be null"); > int length = string.length(); > if (endIndex < 0 || beginIndex > endIndex || beginIndex > length || > endIndex > length) > throw new IllegalArgumentException("Invalid indices"); > int realBeginIndex = string.offsetByCodePoints(0, beginIndex); > int realEndIndex = string.offsetByCodePoints(0, endIndex); > return string.substring(realBeginIndex, realEndIndex); > } > }{code} > The output is: > {noformat} > 👦👩👪👫 > invalid sub: ? > invalid sub: 👦 > invalid sub: ?? > real sub: 👦 > real sub: 👦👩 > real sub: 👩👪{noformat} >  > The same issues appear in Apache Commons Text's substring method. > Should Apache Commons Text use this code or something similar in the > substring implementation, rather than the flawed Java substring method? Or at > least offer an additional utility method that does take a string with unicode > codepoints that require surrogate pairs and substrings it correctly? >  I also posted my implementation at > [https://stackoverflow.com/questions/55663213/java-substring-by-code-point-indices-treating-pairs-of-surrogate-code-units-as/] > asking for advice and there is a more robust version as an answer. -- This message was sent by Atlassian JIRA (v7.6.3#76005)