I created https://issues.apache.org/jira/browse/LANG-1770 to track this report.
Gary On Fri, Apr 11, 2025 at 10:15β―AM Carsten Kirschner <carsten.kirsch...@corussoft.de.invalid> wrote: > > Hello, > > The current commons lang3 StringUtils.abbreviate (3.17.0) implementation will > destroy 4 byte emoji characters and larger grapheme clusters. I know that > handling grapheme correctly before java 20 is not possible, but at least a > codepoint aware solution with String.offsetByCodPoints could be build. I > wrote a small test to show the problem. > The zero width joiners in the family emoji are questionable for the > abbreviate, but there should never be a question mark for an invalid char in > the result as there is now. > > The problem is not so much the βdoesnβt look niceβ aspect of the broken > emoji, but if that abbreviated string is passed to an XML Writer > (com.ctc.wstx.io.UTF8Writer in my case) it throws an exception on this broken > byte sequence. Like this: Caused by: java.io.IOException: Broken surrogate > pair: first char 0xd83c, second 0x2e; illegal combination > at > com.ctc.wstx.io.UTF8Writer._convertSurrogate(UTF8Writer.java:402) > ~[woodstox-core-7.0.0.jar:7.0.0] > > Thanks, > Carsten > > > > import org.apache.commons.lang3.StringUtils; > import org.junit.Test; > import static org.junit.Assert.*; > > public class AbbreviateTest { > > String[] expectedResultsFox = { > "π¦...", // 4 > "π¦π¦...", > "π¦π¦π¦...", > "π¦π¦π¦π¦...", > "π¦π¦π¦π¦π¦...", > "π¦π¦π¦π¦π¦π¦...", > "π¦π¦π¦π¦π¦π¦π¦...", // 10 > }; > > String[] expectedResultsFamilyWithCodepoints = { > "π©...", > "π©π»...", > "π©π»β...", // zero width joiner > "π©π»βπ¨...", > "π©π»βπ¨π»...", > "π©π»βπ¨π»β...", > "π©π»βπ¨π»βπ¦..." > }; > > String[] expectedResultsFamilyWithGrapheme = { > "π©π»βπ¨π»βπ¦π»βπ¦π»...", // 4 > "π©π»βπ¨π»βπ¦π»βπ¦π»π©πΌβπ¨πΌβπ¦πΌβπ¦πΌ...", > > "π©π»βπ¨π»βπ¦π»βπ¦π»π©πΌβπ¨πΌβπ¦πΌβπ¦πΌπ©π½βπ¨π½βπ¦π½βπ¦π½...", > > "π©π»βπ¨π»βπ¦π»βπ¦π»π©πΌβπ¨πΌβπ¦πΌβπ¦πΌπ©π½βπ¨π½βπ¦π½βπ¦π½π©πΎβπ¨πΎβπ¦πΎβπ¦πΎ...", > > "π©π»βπ¨π»βπ¦π»βπ¦π»π©πΌβπ¨πΌβπ¦πΌβπ¦πΌπ©π½βπ¨π½βπ¦π½βπ¦π½π©πΎβπ¨πΎβπ¦πΎβπ¦πΎπ©πΏβπ¨πΏβπ¦πΏβπ¦πΏ...", > > "π©π»βπ¨π»βπ¦π»βπ¦π»π©πΌβπ¨πΌβπ¦πΌβπ¦πΌπ©π½βπ¨π½βπ¦π½βπ¦π½π©πΎβπ¨πΎβπ¦πΎβπ¦πΎπ©πΏβπ¨πΏβπ¦πΏβπ¦πΏπ©π»βπ¨π»βπ¦π»βπ¦π»...", > > "π©π»βπ¨π»βπ¦π»βπ¦π»π©πΌβπ¨πΌβπ¦πΌβπ¦πΌπ©π½βπ¨π½βπ¦π½βπ¦π½π©πΎβπ¨πΎβπ¦πΎβπ¦πΎπ©πΏβπ¨πΏβπ¦πΏβπ¦πΏπ©π»βπ¨π»βπ¦π»βπ¦π»π©πΌβπ¨πΌβπ¦πΌβπ¦πΌ..." > // 10 > }; > > @Test > public void abberviateTest() { > String abbreviateResult; > for(var i = 4; i <= 10; i++) { > abbreviateResult = > StringUtils.abbreviate("π¦π¦π¦π¦π¦π¦π¦π¦π¦π¦π¦π¦π¦π¦", i); > > System.out.println(abbreviateResult); > > //assertEquals(expectedResultsFox[i - 4], abbreviateResult); > } > for(var i = 4; i <= 10; i++) { > abbreviateResult = > StringUtils.abbreviate("π©π»βπ¨π»βπ¦π»βπ¦π»π©πΌβπ¨πΌβπ¦πΌβπ¦πΌπ©π½βπ¨π½βπ¦π½βπ¦π½π©πΎβπ¨πΎβπ¦πΎβπ¦πΎπ©πΏβπ¨πΏβπ¦πΏβπ¦πΏπ©π»βπ¨π»βπ¦π»βπ¦π»π©πΌβπ¨πΌβπ¦πΌβπ¦πΌπ©π½βπ¨π½βπ¦π½βπ¦π½π©πΎβπ¨πΎβπ¦πΎβπ¦πΎπ©πΏβπ¨πΏβπ¦πΏβπ¦πΏ", > i); > > System.out.println(abbreviateResult); > > //assertEquals(expectedResultsFamilyWithCodepoints[i - 4], abbreviateResult); > } > } > } > > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org