[ 
https://issues.apache.org/jira/browse/TEXT-161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16816883#comment-16816883
 ] 

Gilles commented on TEXT-161:
-----------------------------

You should type your example within "code" markers:
{noformat}
{code}
    // Code here...
{code}
{noformat}


> Should there be a better implementation of substring that deals with Unicode 
> surrogate pairs correctly?
> -------------------------------------------------------------------------------------------------------
>
>                 Key: TEXT-161
>                 URL: https://issues.apache.org/jira/browse/TEXT-161
>             Project: Commons Text
>          Issue Type: New Feature
>    Affects Versions: 1.6
>         Environment: Any
>            Reporter: Sebastiaan
>            Priority: Minor
>              Labels: features
>
> There are some major problems with Java's substring implementation which 
> works using chars. For a brief overview read this blog post: 
> [https://codeahoy.com/2016/05/08/the-char-type-in-java-is-broken/]
>  
> I have some demo code showing the issues and a possible solution here:
>  
> {color:#000080}public class {color}SubstringTest {
>  {color:#000080}public static void {color}main(String[] args) {
>  String stringWithPlus2ByteCodePoints = {color:#008000}"👦👩👪👫"{color};
>  String substring1 = 
> stringWithPlus2ByteCodePoints.substring({color:#0000ff}0{color}, 
> {color:#0000ff}1{color});
>  String substring2 = 
> stringWithPlus2ByteCodePoints.substring({color:#0000ff}0{color}, 
> {color:#0000ff}2{color});
>  String substring3 = 
> stringWithPlus2ByteCodePoints.substring({color:#0000ff}1{color}, 
> {color:#0000ff}3{color});
>  System.{color:#660e7a}out{color}.println(stringWithPlus2ByteCodePoints);
>  System.{color:#660e7a}out{color}.println({color:#008000}"invalid sub: " 
> {color}+ substring1);
>  System.{color:#660e7a}out{color}.println({color:#008000}"invalid sub: " 
> {color}+ substring2);
>  System.{color:#660e7a}out{color}.println({color:#008000}"invalid sub: " 
> {color}+ substring3);
>  String realSub1 = getRealSubstring(stringWithPlus2ByteCodePoints, 
> {color:#0000ff}0{color}, {color:#0000ff}1{color});
>  String realSub2 = getRealSubstring(stringWithPlus2ByteCodePoints, 
> {color:#0000ff}0{color}, {color:#0000ff}2{color});
>  String realSub3 = getRealSubstring(stringWithPlus2ByteCodePoints, 
> {color:#0000ff}1{color}, {color:#0000ff}3{color});
>  System.{color:#660e7a}out{color}.println({color:#008000}"real sub: " 
> {color}+ realSub1);
>  System.{color:#660e7a}out{color}.println({color:#008000}"real sub: " 
> {color}+ realSub2);
>  System.{color:#660e7a}out{color}.println({color:#008000}"real sub: " 
> {color}+ realSub3);
>  }
>  {color:#000080}private static {color}String getRealSubstring(String string, 
> {color:#000080}int {color}beginIndex, {color:#000080}int {color}endIndex) {
>  {color:#000080}if {color}(string == {color:#000080}null{color})
>  {color:#000080}throw new 
> {color}IllegalArgumentException({color:#008000}"String should not be 
> null"{color});
>  {color:#000080}int {color}length = string.length();
>  {color:#000080}if {color}(endIndex < {color:#0000ff}0 {color}|| beginIndex > 
> endIndex || beginIndex > length || endIndex > length)
>  {color:#000080}throw new 
> {color}IllegalArgumentException({color:#008000}"Invalid indices"{color});
>  {color:#000080}int {color}realBeginIndex = 
> string.offsetByCodePoints({color:#0000ff}0{color}, beginIndex);
>  {color:#000080}int {color}realEndIndex = 
> string.offsetByCodePoints({color:#0000ff}0{color}, endIndex);
>  {color:#000080}return {color}string.substring(realBeginIndex, realEndIndex);
>  }
> }
>  
> The same issues appear in Apache Commons Text's substring method.
> Should Apache Commons Text use this code or something similar in the 
> substring implementation, rather than the flawed Java substring method? Or at 
> least offer an additional utility method that does take a string with unicode 
> codepoints that require surrogate pairs and substrings it correctly?
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to