[jira] [Comment Edited] (LANG-1770) StringUtils.abbreviate is not emoji aware, breaks surrogate pairs

Leela Venkatesh V (Jira) Fri, 19 Jun 2026 07:52:38 -0700


    [ 
https://issues.apache.org/jira/browse/LANG-1770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18090225#comment-18090225
 ]


Leela Venkatesh V edited comment on LANG-1770 at 6/19/26 2:51 PM:
------------------------------------------------------------------

Hi [~ggregory],

I've spent some time investigating the issue and comparing approaches used in 
other libraries.

>From my research, the immediate problem appears to be that abbreviate() can 
>return ill-formed Unicode strings by splitting UTF-16 surrogate pairs. This 
>seems to be the root cause of the XML example discussed in the issue.

I also reviewed Guava's string utilities and noticed that methods such as 
commonPrefix() and commonSuffix() explicitly avoid splitting surrogate pairs, 
while not attempting full grapheme-cluster support. ICU4J, on the other hand, 
provides grapheme-aware boundary handling through Unicode text segmentation 
APIs.

While investigating Commons Lang itself, I noticed that there are already 
methods that operate with some level of code-point awareness (for example 
capitalize() and getCommonPrefix()), whereas methods such as abbreviate(), 
left(), right(), mid(), and chop() still rely on UTF-16 code-unit indexing and 
can potentially split surrogate pairs.

I also noted your suggestion of potentially extending the newer Strings 
abstraction with a Unicode/grapheme-aware implementation. Since that appears to 
be a larger architectural direction, I wasn't sure whether it should be 
considered as part of this issue or separately from addressing the immediate 
surrogate-pair problem.

Based on these findings, I'm unsure which direction would be preferred for 
Commons Lang.
 # A minimal fix that preserves the existing API semantics and behavior, but 
validates abbreviation boundaries to ensure surrogate pairs are never split. 
This would address the creation of invalid Unicode strings while keeping the 
current UTF-16 code-unit based API contract intact.
 # A broader design change that moves StringUtils methods toward 
code-point-aware behavior. This would change how lengths, offsets, and 
truncation boundaries are interpreted and may need to be considered more 
consistently across StringUtils and potentially the wider library.

Before proceeding further, I'd like to understand which direction would be 
preferred for this issue.

Thanks.


was (Author: JIRAUSER312373):
Hi [~ggregory],

I've spent some time investigating the issue and comparing approaches used in 
other libraries.

>From my research, the immediate problem appears to be that abbreviate() can 
>return ill-formed Unicode strings by splitting UTF-16 surrogate pairs. This 
>seems to be the root cause of the XML example discussed in the issue.

I also reviewed Guava's string utilities and noticed that methods such as 
commonPrefix() and commonSuffix() explicitly avoid splitting surrogate pairs, 
while not attempting full grapheme-cluster support. ICU4J, on the other hand, 
provides grapheme-aware boundary handling through Unicode text segmentation 
APIs.

While investigating Commons Lang itself, I noticed that there are already 
methods that operate with some level of code-point awareness (for example 
capitalize() and getCommonPrefix()), whereas methods such as abbreviate(), 
left(), right(), mid(), and chop() still rely on UTF-16 code-unit indexing and 
can potentially split surrogate pairs.

I also noted your suggestion of potentially extending the newer Strings 
abstraction with a Unicode/grapheme-aware implementation. Since that appears to 
be a larger architectural direction, I wasn't sure whether it should be 
considered as part of this issue or separately from addressing the immediate 
surrogate-pair problem.

Based on these findings, I'm unsure which direction would be preferred for 
Commons Lang.
 # A minimal fix that preserves the existing API semantics and behavior, but 
validates abbreviation boundaries to ensure surrogate pairs are never split. 
This would address the creation of invalid Unicode strings while keeping the 
current UTF-16 code-unit based API contract intact.

 # A broader design change that moves StringUtils methods toward 
code-point-aware behavior. This would change how lengths, offsets, and 
truncation boundaries are interpreted and may need to be considered more 
consistently across StringUtils and potentially the wider library.

Before proceeding further, I'd like to understand which direction would be 
preferred for this issue.

Thanks.

> StringUtils.abbreviate is not emoji aware, breaks surrogate pairs
> -----------------------------------------------------------------
>
>                 Key: LANG-1770
>                 URL: https://issues.apache.org/jira/browse/LANG-1770
>             Project: Commons Lang
>          Issue Type: Bug
>          Components: lang.*
>    Affects Versions: 3.17.0
>            Reporter: Gary D. Gregory
>            Priority: Major
>
> ---------- Forwarded message ---------
> From: Carsten Kirschner <[email protected]>
> Date: Fri, Apr 11, 2025 at 10:15 AM
> Subject: [lang] StringUtils.abbreviate is not emoji aware, breaks surrogate 
> pairs
> To: [email protected] <[email protected]>
> Hello,
> The current commons lang3 StringUtils.abbreviate (3.17.0) implementation will 
> destroy 4 byte emoji characters and larger grapheme clusters. I know that 
> handling grapheme correctly before java 20 is not possible, but at least a 
> codepoint aware solution with String.offsetByCodPoints could be build. I 
> wrote a small test to show the problem.
> The zero width joiners in the family emoji are questionable for the 
> abbreviate, but there should never be a question mark for an invalid char in 
> the result as there is now.
> The problem is not so much the „doesn’t look nice“ aspect of the broken 
> emoji, but if that abbreviated string is passed to an XML Writer 
> (com.ctc.wstx.io.UTF8Writer in my case) it throws an exception on this broken 
> byte sequence. Like this: Caused by: java.io.IOException: Broken surrogate 
> pair: first char 0xd83c, second 0x2e; illegal combination
>                 at 
> com.ctc.wstx.io.UTF8Writer._convertSurrogate(UTF8Writer.java:402) 
> ~[woodstox-core-7.0.0.jar:7.0.0]
> Thanks,
> Carsten
> {code:java}
> import org.apache.commons.lang3.StringUtils;
> import org.junit.Test;
> import static org.junit.Assert.*;
> public class AbbreviateTest {
>                 String[] expectedResultsFox = {
>                                                "🦊...", // 4
>                                                "🦊🦊...",
>                                                "🦊🦊🦊...",
>                                                "🦊🦊🦊🦊...",
>                                                "🦊🦊🦊🦊🦊...",
>                                                "🦊🦊🦊🦊🦊🦊...",
>                                                "🦊🦊🦊🦊🦊🦊🦊...", // 10
>                 };
>                 String[] expectedResultsFamilyWithCodepoints = {
>                                                "👩...",
>                                                "👩🏻...",
>                                                "👩🏻‍...", // zero width joiner
>                                                "👩🏻‍👨...",
>                                                "👩🏻‍👨🏻...",
>                                                "👩🏻‍👨🏻‍...",
>                                                "👩🏻‍👨🏻‍👦..."
>                 };
>                 String[] expectedResultsFamilyWithGrapheme = {
>                                                "👩🏻‍👨🏻‍👦🏻‍👦🏻...", // 4
>                                                "👩🏻‍👨🏻‍👦🏻‍👦🏻👩🏼‍👨🏼‍👦🏼‍👦🏼...",
>                                                
> "👩🏻‍👨🏻‍👦🏻‍👦🏻👩🏼‍👨🏼‍👦🏼‍👦🏼👩🏽‍👨🏽‍👦🏽‍👦🏽...",
>                                                
> "👩🏻‍👨🏻‍👦🏻‍👦🏻👩🏼‍👨🏼‍👦🏼‍👦🏼👩🏽‍👨🏽‍👦🏽‍👦🏽👩🏾‍👨🏾‍👦🏾‍👦🏾...",
>                                                
> "👩🏻‍👨🏻‍👦🏻‍👦🏻👩🏼‍👨🏼‍👦🏼‍👦🏼👩🏽‍👨🏽‍👦🏽‍👦🏽👩🏾‍👨🏾‍👦🏾‍👦🏾👩🏿‍👨🏿‍👦🏿‍👦🏿...",
>                                                
> "👩🏻‍👨🏻‍👦🏻‍👦🏻👩🏼‍👨🏼‍👦🏼‍👦🏼👩🏽‍👨🏽‍👦🏽‍👦🏽👩🏾‍👨🏾‍👦🏾‍👦🏾👩🏿‍👨🏿‍👦🏿‍👦🏿👩🏻‍👨🏻‍👦🏻‍👦🏻...",
>                                                
> "👩🏻‍👨🏻‍👦🏻‍👦🏻👩🏼‍👨🏼‍👦🏼‍👦🏼👩🏽‍👨🏽‍👦🏽‍👦🏽👩🏾‍👨🏾‍👦🏾‍👦🏾👩🏿‍👨🏿‍👦🏿‍👦🏿👩🏻‍👨🏻‍👦🏻‍👦🏻👩🏼‍👨🏼‍👦🏼‍👦🏼..."
>  // 10
>                 };
>                 @Test
>                 public void abberviateTest() {
>                                String abbreviateResult;
>                                for(var i = 4; i <= 10; i++) {
>                                                abbreviateResult = 
> StringUtils.abbreviate("🦊🦊🦊🦊🦊🦊🦊🦊🦊🦊🦊🦊🦊🦊", i);
>                                                
> System.out.println(abbreviateResult);
>                                                
> //assertEquals(expectedResultsFox[i - 4], abbreviateResult);
>                                }
>                                for(var i = 4; i <= 10; i++) {
>                                                abbreviateResult = 
> StringUtils.abbreviate("👩🏻‍👨🏻‍👦🏻‍👦🏻👩🏼‍👨🏼‍👦🏼‍👦🏼👩🏽‍👨🏽‍👦🏽‍👦🏽👩🏾‍👨🏾‍👦🏾‍👦🏾👩🏿‍👨🏿‍👦🏿‍👦🏿👩🏻‍👨🏻‍👦🏻‍👦🏻👩🏼‍👨🏼‍👦🏼‍👦🏼👩🏽‍👨🏽‍👦🏽‍👦🏽👩🏾‍👨🏾‍👦🏾‍👦🏾👩🏿‍👨🏿‍👦🏿‍👦🏿",
>  i);
>                                                
> System.out.println(abbreviateResult);
>                                                
> //assertEquals(expectedResultsFamilyWithCodepoints[i - 4], abbreviateResult);
>                                }
>                 }
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (LANG-1770) StringUtils.abbreviate is not emoji aware, breaks surrogate pairs

Reply via email to