[jira] [Created] (TEXT-161) Should there be a better implementation of substring that deals with Unicode surrogate pairs correctly?

Sebastiaan (JIRA) Sat, 13 Apr 2019 01:51:52 -0700

Sebastiaan created TEXT-161:
-------------------------------

             Summary: Should there be a better implementation of substring that 
deals with Unicode surrogate pairs correctly?
                 Key: TEXT-161
                 URL: https://issues.apache.org/jira/browse/TEXT-161
             Project: Commons Text
          Issue Type: New Feature
    Affects Versions: 1.6
         Environment: Any
            Reporter: Sebastiaan



There are some major problems with Java's substring implementation which works 
using chars. For a brief overview read this blog post: 
[https://codeahoy.com/2016/05/08/the-char-type-in-java-is-broken/]

 

I have some demo code showing the issues and a possible solution here:

 

{color:#000080}public class {color}SubstringTest {
 {color:#000080}public static void {color}main(String[] args) {

 String stringWithPlus2ByteCodePoints = {color:#008000}"👦👩👪👫"{color};

 String substring1 = 
stringWithPlus2ByteCodePoints.substring({color:#0000ff}0{color}, 
{color:#0000ff}1{color});
 String substring2 = 
stringWithPlus2ByteCodePoints.substring({color:#0000ff}0{color}, 
{color:#0000ff}2{color});
 String substring3 = 
stringWithPlus2ByteCodePoints.substring({color:#0000ff}1{color}, 
{color:#0000ff}3{color});

 System.{color:#660e7a}out{color}.println(stringWithPlus2ByteCodePoints);
 System.{color:#660e7a}out{color}.println({color:#008000}"invalid sub: " 
{color}+ substring1);
 System.{color:#660e7a}out{color}.println({color:#008000}"invalid sub: " 
{color}+ substring2);
 System.{color:#660e7a}out{color}.println({color:#008000}"invalid sub: " 
{color}+ substring3);

 String realSub1 = getRealSubstring(stringWithPlus2ByteCodePoints, 
{color:#0000ff}0{color}, {color:#0000ff}1{color});
 String realSub2 = getRealSubstring(stringWithPlus2ByteCodePoints, 
{color:#0000ff}0{color}, {color:#0000ff}2{color});
 String realSub3 = getRealSubstring(stringWithPlus2ByteCodePoints, 
{color:#0000ff}1{color}, {color:#0000ff}3{color});
 System.{color:#660e7a}out{color}.println({color:#008000}"real sub: " {color}+ 
realSub1);
 System.{color:#660e7a}out{color}.println({color:#008000}"real sub: " {color}+ 
realSub2);
 System.{color:#660e7a}out{color}.println({color:#008000}"real sub: " {color}+ 
realSub3);
 }

 {color:#000080}private static {color}String getRealSubstring(String string, 
{color:#000080}int {color}beginIndex, {color:#000080}int {color}endIndex) {
 {color:#000080}if {color}(string == {color:#000080}null{color})
 {color:#000080}throw new 
{color}IllegalArgumentException({color:#008000}"String should not be 
null"{color});
 {color:#000080}int {color}length = string.length();
 {color:#000080}if {color}(endIndex < {color:#0000ff}0 {color}|| beginIndex > 
endIndex || beginIndex > length || endIndex > length)
 {color:#000080}throw new 
{color}IllegalArgumentException({color:#008000}"Invalid indices"{color});
 {color:#000080}int {color}realBeginIndex = 
string.offsetByCodePoints({color:#0000ff}0{color}, beginIndex);
 {color:#000080}int {color}realEndIndex = 
string.offsetByCodePoints({color:#0000ff}0{color}, endIndex);
 {color:#000080}return {color}string.substring(realBeginIndex, realEndIndex);
 }


}

 

The same issues appear in Apache Commons Text's substring method.

Should Apache Commons Text use this code or something similar in the substring 
implementation, rather than the flawed Java substring method? Or at least offer 
an additional utility method that does take a string with unicode codepoints 
that require surrogate pairs and substrings it correctly?

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (TEXT-161) Should there be a better implementation of substring that deals with Unicode surrogate pairs correctly?

Reply via email to