[
https://issues.apache.org/jira/browse/TEXT-161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16817329#comment-16817329
]
Rob Tompkins commented on TEXT-161:
-----------------------------------
Looks like something up my alley
> Should there be a better implementation of substring that deals with Unicode
> surrogate pairs correctly?
> -------------------------------------------------------------------------------------------------------
>
> Key: TEXT-161
> URL: https://issues.apache.org/jira/browse/TEXT-161
> Project: Commons Text
> Issue Type: New Feature
> Affects Versions: 1.6
> Environment: Any
> Reporter: Sebastiaan
> Assignee: Rob Tompkins
> Priority: Minor
> Labels: features
>
> There are some major problems with Java's substring implementation which
> works using chars. For a brief overview read this blog post:
> [https://codeahoy.com/2016/05/08/the-char-type-in-java-is-broken/]
>
> I have some demo code showing the issues and a possible solution here:
> {code:java}
> public class SubstringTest {
> public static void main(String[] args) {
> String stringWithPlus2ByteCodePoints = "👦👩👪👫";
> String substring1 = stringWithPlus2ByteCodePoints.substring(0, 1);
> String substring2 = stringWithPlus2ByteCodePoints.substring(0, 2);
> String substring3 = stringWithPlus2ByteCodePoints.substring(1, 3);
> System.out.println(stringWithPlus2ByteCodePoints);
> System.out.println("invalid sub: " + substring1);
> System.out.println("invalid sub: " + substring2);
> System.out.println("invalid sub: " + substring3);
> String realSub1 = getRealSubstring(stringWithPlus2ByteCodePoints, 0,
> 1);
> String realSub2 = getRealSubstring(stringWithPlus2ByteCodePoints, 0,
> 2);
> String realSub3 = getRealSubstring(stringWithPlus2ByteCodePoints, 1,
> 3);
> System.out.println("real sub: " + realSub1);
> System.out.println("real sub: " + realSub2);
> System.out.println("real sub: " + realSub3);
> }
> private static String getRealSubstring(String string, int beginIndex, int
> endIndex) {
> if (string == null)
> throw new IllegalArgumentException("String should not be null");
> int length = string.length();
> if (endIndex < 0 || beginIndex > endIndex || beginIndex > length ||
> endIndex > length)
> throw new IllegalArgumentException("Invalid indices");
> int realBeginIndex = string.offsetByCodePoints(0, beginIndex);
> int realEndIndex = string.offsetByCodePoints(0, endIndex);
> return string.substring(realBeginIndex, realEndIndex);
> }
> }{code}
> The output is:
> {noformat}
> 👦👩👪👫
> invalid sub: ?
> invalid sub: 👦
> invalid sub: ??
> real sub: 👦
> real sub: 👦👩
> real sub: 👩👪{noformat}
>
> The same issues appear in Apache Commons Text's substring method.
> Should Apache Commons Text use this code or something similar in the
> substring implementation, rather than the flawed Java substring method? Or at
> least offer an additional utility method that does take a string with unicode
> codepoints that require surrogate pairs and substrings it correctly?
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)