[PR] [StringUtils::indexOfAnyBut] redesign due to inconsistent/faulty behaviour regarding UTF-16 surrogates [commons-lang]

via GitHub Tue, 03 Dec 2024 01:42:55 -0800


IBue opened a new pull request, #1327:
URL: https://github.com/apache/commons-lang/pull/1327


   depends/stacked on #1326 
   
   Both signatures of `StringUtils::indexOfAnyBut` currently behave
   inconsistently in matching UTF-16 supplementary characters and single
   UTF-16 surrogate characters (i.e. paired and unpaired surrogates), since
   they differ unnecessarily in their algorithmic implementations, use
   their own incomplete and faulty interpretation of UTF-16 and don't take
   full advantage of the standard library.
   
   The example cases below show that they may yield contradictory results
   or correct results for the wrong reasons.
   
   This proposal gives a unified algorithmic implementation of both
   signatures that
   
   - a) is much easier to grasp due to a clear mathematical set approach and
         safe iteration and doesn't become entangled in index arithmetic;
         stresses the set semantics of the 2nd argument
   - b) fully relies on the standard library for defined UTF-16
          handling/interpretation;
         paired surrogates are merged into one codepoint, unpaired surrogates
         are left as they are
   - c) scales _much_ better with input sizes and result index position
   - d) can benefit from current and future improvements in the standard
          library and JVM
          (streams implementation, parallelization, JIT optimization, JEP 218,  
???…)
   
   The algorithm boils down to:
   **find index i of first char in srcChars such that**
   `(srcChars.codePointAt(i) ∈ {x ∈ codepoints(srcChars) ∣ x ∉ 
codepoints(searchChars) })`
   
   Examples:
   ---------
   
   `<H>`: high-surrogate character
   `<L>`: low-surrogate character
   `(<H><L>)`: valid supplementary character
   signature 1: `StringUtils::indexOfAnyBut(final CharSequence srcChars, final 
CharSequence searchChars)`
   signature 2: `StringUtils::indexOfAnyBut(final CharSequence srcChars, final 
char... searchChars)`
   
   **Case 1:** matching of unpaired high-surrogate
   || srcChars | searchChars | expected/new  |   sig.1    |   sig.2   |
   |---|---|---|---|---|---|
   | 1.1  | `<H>aaaa` | `<H>abcd` | !found | !found | !found |  |
   | | | | | sig.1: 'a' is _somewhere_ in `searchChars` |   sig.2: 'a' happens 
to follow `<H>` in `searchChars`; |
   | 1.2  | `<H>baaa` |   `<H>abcd`  | !found | !found| 0 |
   | | | | | sig.1: 'b' is _somewhere_ in searchChars |
   | 1.3  | `<H>aaaa` |  `(<H><L>)abcd` | 0 |  !found   | 0 |
   | | | | | sig.1: 'a' is _somewhere_ in searchChars |
   | 1.4  | `aaaa<H>` | `(<H><L>)abcd`  | 4 | !found | !found | sig.1+2 don't 
interpret suppl. character |
   
   **Case 2:** matching of unpaired low-surrogate
   || srcChars | searchChars | expected/new  |   sig.1    |   sig.2   |
   |---|---|---|---|---|---|
   | 2.1 | `<L>aaaa` | `(<H><L>)abcd` |  0 | !found | !found | 
   | | | | | | sig.1+2 don't interpret suppl. character |
   | 2.2 | `aaaa<L>` | ` (<H><L>)abcd` | 4 | !found | !found |
   | | | | | | sig.1+2 don't interpret suppl. character |
   
   **Case 3:** matching of supplementary character
   || srcChars | searchChars | expected/new  |   sig.1    |   sig.2   |
   |---|---|---|---|---|---|
   | 3.1 | `(<H><L>)aaaa` | `<L>ab<H>cd` | 0 | !found | 0 |
   | | | | | sig.1: `<L>` is _somewhere_ in `searchChars` | |
   | 3.2 | `(<H><L>)aaaa` | `abcd` | 0 | 1 | 0 |
   ||| | | sig.1 always points to low-surrogate of (fully) unmatched  suppl. 
character | |
   | 3.3 | `(<H><L>)aaaa` | `abcd<H>` | 0 | 0 | 1 |
   | 3.4  | `(<H><L>)aaaa` | `abcd<L>` | 0 | !found | 0 |
   | | | | | sig.1: `<H>` skipped by algorithm | |
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] [StringUtils::indexOfAnyBut] redesign due to inconsistent/faulty behaviour regarding UTF-16 surrogates [commons-lang]

Reply via email to