IBue opened a new pull request, #1327:
URL: https://github.com/apache/commons-lang/pull/1327
depends/stacked on #1326
Both signatures of `StringUtils::indexOfAnyBut` currently behave
inconsistently in matching UTF-16 supplementary characters and single
UTF-16 surrogate characters (i.e. paired and unpaired surrogates), since
they differ unnecessarily in their algorithmic implementations, use
their own incomplete and faulty interpretation of UTF-16 and don't take
full advantage of the standard library.
The example cases below show that they may yield contradictory results
or correct results for the wrong reasons.
This proposal gives a unified algorithmic implementation of both
signatures that
- a) is much easier to grasp due to a clear mathematical set approach and
safe iteration and doesn't become entangled in index arithmetic;
stresses the set semantics of the 2nd argument
- b) fully relies on the standard library for defined UTF-16
handling/interpretation;
paired surrogates are merged into one codepoint, unpaired surrogates
are left as they are
- c) scales _much_ better with input sizes and result index position
- d) can benefit from current and future improvements in the standard
library and JVM
(streams implementation, parallelization, JIT optimization, JEP 218,
???…)
The algorithm boils down to:
**find index i of first char in srcChars such that**
`(srcChars.codePointAt(i) ∈ {x ∈ codepoints(srcChars) ∣ x ∉
codepoints(searchChars) })`
Examples:
---------
`<H>`: high-surrogate character
`<L>`: low-surrogate character
`(<H><L>)`: valid supplementary character
signature 1: `StringUtils::indexOfAnyBut(final CharSequence srcChars, final
CharSequence searchChars)`
signature 2: `StringUtils::indexOfAnyBut(final CharSequence srcChars, final
char... searchChars)`
**Case 1:** matching of unpaired high-surrogate
|| srcChars | searchChars | expected/new | sig.1 | sig.2 |
|---|---|---|---|---|---|
| 1.1 | `<H>aaaa` | `<H>abcd` | !found | !found | !found | |
| | | | | sig.1: 'a' is _somewhere_ in `searchChars` | sig.2: 'a' happens
to follow `<H>` in `searchChars`; |
| 1.2 | `<H>baaa` | `<H>abcd` | !found | !found| 0 |
| | | | | sig.1: 'b' is _somewhere_ in searchChars |
| 1.3 | `<H>aaaa` | `(<H><L>)abcd` | 0 | !found | 0 |
| | | | | sig.1: 'a' is _somewhere_ in searchChars |
| 1.4 | `aaaa<H>` | `(<H><L>)abcd` | 4 | !found | !found | sig.1+2 don't
interpret suppl. character |
**Case 2:** matching of unpaired low-surrogate
|| srcChars | searchChars | expected/new | sig.1 | sig.2 |
|---|---|---|---|---|---|
| 2.1 | `<L>aaaa` | `(<H><L>)abcd` | 0 | !found | !found |
| | | | | | sig.1+2 don't interpret suppl. character |
| 2.2 | `aaaa<L>` | ` (<H><L>)abcd` | 4 | !found | !found |
| | | | | | sig.1+2 don't interpret suppl. character |
**Case 3:** matching of supplementary character
|| srcChars | searchChars | expected/new | sig.1 | sig.2 |
|---|---|---|---|---|---|
| 3.1 | `(<H><L>)aaaa` | `<L>ab<H>cd` | 0 | !found | 0 |
| | | | | sig.1: `<L>` is _somewhere_ in `searchChars` | |
| 3.2 | `(<H><L>)aaaa` | `abcd` | 0 | 1 | 0 |
||| | | sig.1 always points to low-surrogate of (fully) unmatched suppl.
character | |
| 3.3 | `(<H><L>)aaaa` | `abcd<H>` | 0 | 0 | 1 |
| 3.4 | `(<H><L>)aaaa` | `abcd<L>` | 0 | !found | 0 |
| | | | | sig.1: `<H>` skipped by algorithm | |
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]