[jira] [Commented] (LANG-1451) Should there be a better implementation of substring that deals with Unicode surrogate pairs correctly?

Rob Tompkins (JIRA) Tue, 30 Apr 2019 05:30:25 -0700


    [ 
https://issues.apache.org/jira/browse/LANG-1451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16830242#comment-16830242
 ]


Rob Tompkins commented on LANG-1451:
------------------------------------

[~Sebastiaan83] - when I try to even instantiate a string like one of those 
above, I am unable to do so. Do you have some special settings such that you 
can instantiate such strings? For example I get the followoing:

{code:java}
System.out.println("👦👩👪👫"); \\ ????
{code}

The best thing that I could do to get the output was to do the following:

{code:java}
String PLUS_2_BYTE_CODE_POINTS = 
"\uD83D\uDC66\uD83D\uDC69\uD83D\uDC6A\uD83D\uDC6B";
{code}

Does that work for you?

> Should there be a better implementation of substring that deals with Unicode 
> surrogate pairs correctly?
> -------------------------------------------------------------------------------------------------------
>
>                 Key: LANG-1451
>                 URL: https://issues.apache.org/jira/browse/LANG-1451
>             Project: Commons Lang
>          Issue Type: New Feature
>    Affects Versions: 3.9
>         Environment: Any
>            Reporter: Sebastiaan
>            Assignee: Rob Tompkins
>            Priority: Minor
>              Labels: features
>
> There are some major problems with Java's substring implementation which 
> works using chars. For a brief overview read this blog post: 
> [https://codeahoy.com/2016/05/08/the-char-type-in-java-is-broken/]
>  
> I have some demo code showing the issues and a possible solution here:
> {code:java}
> public class SubstringTest {
>     public static void main(String[] args) {
>         String stringWithPlus2ByteCodePoints = "👦👩👪👫";
>         String substring1 = stringWithPlus2ByteCodePoints.substring(0, 1);
>         String substring2 = stringWithPlus2ByteCodePoints.substring(0, 2);
>         String substring3 = stringWithPlus2ByteCodePoints.substring(1, 3);
>         System.out.println(stringWithPlus2ByteCodePoints);
>         System.out.println("invalid sub: " + substring1);
>         System.out.println("invalid sub: " + substring2);
>         System.out.println("invalid sub: " + substring3);
>         String realSub1 = getRealSubstring(stringWithPlus2ByteCodePoints, 0, 
> 1);
>         String realSub2 = getRealSubstring(stringWithPlus2ByteCodePoints, 0, 
> 2);
>         String realSub3 = getRealSubstring(stringWithPlus2ByteCodePoints, 1, 
> 3);
>         System.out.println("real sub: " + realSub1);
>         System.out.println("real sub: " + realSub2);
>         System.out.println("real sub: " + realSub3);
>     }
>     private static String getRealSubstring(String string, int beginIndex, int 
> endIndex) {
>         if (string == null)
>             throw new IllegalArgumentException("String should not be null");
>         int length = string.length();
>         if (endIndex < 0 || beginIndex > endIndex || beginIndex > length || 
> endIndex > length)
>             throw new IllegalArgumentException("Invalid indices");
>         int realBeginIndex = string.offsetByCodePoints(0, beginIndex);
>         int realEndIndex = string.offsetByCodePoints(0, endIndex);
>         return string.substring(realBeginIndex, realEndIndex);
>     }
> }{code}
> The output is:
> {noformat}
> 👦👩👪👫
> invalid sub: ?
> invalid sub: 👦
> invalid sub: ??
> real sub: 👦
> real sub: 👦👩
> real sub: 👩👪{noformat}
>  
> The same issues appear in Apache Commons Text's substring method.
> Should Apache Commons Text use this code or something similar in the 
> substring implementation, rather than the flawed Java substring method? Or at 
> least offer an additional utility method that does take a string with unicode 
> codepoints that require surrogate pairs and substrings it correctly?
>  I also posted my implementation at 
> [https://stackoverflow.com/questions/55663213/java-substring-by-code-point-indices-treating-pairs-of-surrogate-code-units-as/]
>  asking for advice and there is a more robust version as an answer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (LANG-1451) Should there be a better implementation of substring that deals with Unicode surrogate pairs correctly?

Reply via email to