Hi,

I wanted to get input on a potential bug in the regex \b handling of
non-spacing marks without UNICODE_CHARACTER_CLASS:
https://bugs.openjdk.org/browse/JDK-8384137.

I'd appreciate any input on whether the analysis here is correct, and if so
whether it would be better to change the implementation or the Pattern
specification, and also whether any of this is worth changing given the
potential compatibility impact. I think this is an edge case that may not
affect much real world code, I plan to do more corpus analysis and can
report back.

Consider a string like "\u8FE3\u030c" and the regex \b. U+8FE3 'CJK Unified
Ideograph-8FE3' matches \w if and only if UNICODE_CHARACTER_CLASS is set.
U+030C 'Combining Caron' is a non-spacing mark.

With UNICODE_CHARACTER_CLASS set, the Pattern javadoc specifies that \w
matches characters including \p{gc=Mn}, which matches U+030C. So \b matches
the input string at [0, 2], i.e. the entire string is a word with
boundaries at the beginning and end.

When UNICODE_CHARACTER_CLASS is _not_ set, the implementation of \b has
logic to count "non spacing marks as word characters in bounds calculations
if they have a base character". (This isn't mentioned in the specification
for Pattern, see JDK-6452709)

https://github.com/openjdk/jdk/blob/253df3a580b37ee277cb6a6ccd604ebaf28d4468/src/java.base/share/classes/java/util/regex/Pattern.java#L5576-L5577

That logic uses Character.isLetterOrDigit to check if a character is a base
character, so \b ends up treating U+8FE3 by itself as a non-ASCII-word
character, and then U+030C as a word character because it sees the
non-ASCII base character. It reports a word boundary between the
non-spacing mark and the base character. That seems like a bug,
https://www.unicode.org/reports/tr18/#RL1.4 says

> Nonspacing marks are never divided from their base characters, and
otherwise ignored in locating boundaries.

I think this is partially a regression after JDK-8264160, which updated \b
to only match ASCII word characters, but didn't update the base character
logic which is still using isLetterOrDigit.

I think one potential path here would be to (1) update the Pattern
specification to mention that \b never divides non-spacing marks from their
base characters, i.e. without UNICODE_CHARACTER_CLASS it will treat
trailing NSM characters as part of the same word as ASCII base characters,
and (2) fix the NSM logic to check for ASCII base characters instead of
using Character.isLetterOrDigit for consistency with the changes in
JDK-8264160.

(Additionally, there's a potential bug in the handling of surrogate pairs
in the NSM logic, which is how I started looking at this area: JDK-8384082)

Thanks,
Liam

Reply via email to