[
https://issues.apache.org/jira/browse/TEXT-161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16816883#comment-16816883
]
Gilles commented on TEXT-161:
-----------------------------
You should type your example within "code" markers:
{noformat}
{code}
// Code here...
{code}
{noformat}
> Should there be a better implementation of substring that deals with Unicode
> surrogate pairs correctly?
> -------------------------------------------------------------------------------------------------------
>
> Key: TEXT-161
> URL: https://issues.apache.org/jira/browse/TEXT-161
> Project: Commons Text
> Issue Type: New Feature
> Affects Versions: 1.6
> Environment: Any
> Reporter: Sebastiaan
> Priority: Minor
> Labels: features
>
> There are some major problems with Java's substring implementation which
> works using chars. For a brief overview read this blog post:
> [https://codeahoy.com/2016/05/08/the-char-type-in-java-is-broken/]
>
> I have some demo code showing the issues and a possible solution here:
>
> {color:#000080}public class {color}SubstringTest {
> {color:#000080}public static void {color}main(String[] args) {
> String stringWithPlus2ByteCodePoints = {color:#008000}"👦👩👪👫"{color};
> String substring1 =
> stringWithPlus2ByteCodePoints.substring({color:#0000ff}0{color},
> {color:#0000ff}1{color});
> String substring2 =
> stringWithPlus2ByteCodePoints.substring({color:#0000ff}0{color},
> {color:#0000ff}2{color});
> String substring3 =
> stringWithPlus2ByteCodePoints.substring({color:#0000ff}1{color},
> {color:#0000ff}3{color});
> System.{color:#660e7a}out{color}.println(stringWithPlus2ByteCodePoints);
> System.{color:#660e7a}out{color}.println({color:#008000}"invalid sub: "
> {color}+ substring1);
> System.{color:#660e7a}out{color}.println({color:#008000}"invalid sub: "
> {color}+ substring2);
> System.{color:#660e7a}out{color}.println({color:#008000}"invalid sub: "
> {color}+ substring3);
> String realSub1 = getRealSubstring(stringWithPlus2ByteCodePoints,
> {color:#0000ff}0{color}, {color:#0000ff}1{color});
> String realSub2 = getRealSubstring(stringWithPlus2ByteCodePoints,
> {color:#0000ff}0{color}, {color:#0000ff}2{color});
> String realSub3 = getRealSubstring(stringWithPlus2ByteCodePoints,
> {color:#0000ff}1{color}, {color:#0000ff}3{color});
> System.{color:#660e7a}out{color}.println({color:#008000}"real sub: "
> {color}+ realSub1);
> System.{color:#660e7a}out{color}.println({color:#008000}"real sub: "
> {color}+ realSub2);
> System.{color:#660e7a}out{color}.println({color:#008000}"real sub: "
> {color}+ realSub3);
> }
> {color:#000080}private static {color}String getRealSubstring(String string,
> {color:#000080}int {color}beginIndex, {color:#000080}int {color}endIndex) {
> {color:#000080}if {color}(string == {color:#000080}null{color})
> {color:#000080}throw new
> {color}IllegalArgumentException({color:#008000}"String should not be
> null"{color});
> {color:#000080}int {color}length = string.length();
> {color:#000080}if {color}(endIndex < {color:#0000ff}0 {color}|| beginIndex >
> endIndex || beginIndex > length || endIndex > length)
> {color:#000080}throw new
> {color}IllegalArgumentException({color:#008000}"Invalid indices"{color});
> {color:#000080}int {color}realBeginIndex =
> string.offsetByCodePoints({color:#0000ff}0{color}, beginIndex);
> {color:#000080}int {color}realEndIndex =
> string.offsetByCodePoints({color:#0000ff}0{color}, endIndex);
> {color:#000080}return {color}string.substring(realBeginIndex, realEndIndex);
> }
> }
>
> The same issues appear in Apache Commons Text's substring method.
> Should Apache Commons Text use this code or something similar in the
> substring implementation, rather than the flawed Java substring method? Or at
> least offer an additional utility method that does take a string with unicode
> codepoints that require surrogate pairs and substrings it correctly?
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)