On Wed, 16 Feb 2022 21:00:00 GMT, Naoto Sato <[email protected]> wrote:
>> This is a fix in the buggy way CIBackRef traverses unicode characters that
>> could be variable-length. Originally it followed the approach that BackRef
>> does, but failed to account for unicode characters that could be 2
>> chars-long. The upper bound (groupSize) for the traversing loop is set by
>> the difference between group start and stop indexes. This works for single
>> char characters and it also works for case-sensitive comparisons because
>> byte-by-byte comparisons are acceptable, but it doesn't work for a
>> comparison where some kind of normalization (i.e. case) is required. This
>> fix adjusts the upper bound for the loop that traverses the character when a
>> two-char character is encountered.
>>
>> An alternative was to check the length of the group size by scanning the
>> group in advance and converting to code points, but this could potentially
>> result in multiple scans and codepoint conversions of the same matcher group
>> which could be long. The solution that adjusts the loop bounds on the fly
>> avoids this case.
>
> src/java.base/share/classes/java/util/regex/Pattern.java line 5104:
>
>> 5102: j += Character.charCount(c2);
>> 5103:
>> 5104: if(xIncr > 1) {
>
> You can eliminate `xIncr` by comparing `c1 >=
> Character.MIN_SUPPLEMENTARY_CODE_POINT` here.
Nice! Thanks will do.
-------------
PR: https://git.openjdk.java.net/jdk/pull/7501