Hi,
While still waiting patiently for the review for
RFR: Regex exponential backtracking issue [1]
RFR: Regex exponential backtracking issue --- more cleanup/tuning [2]
Here is the third round of change to address the "broken canonical
equivalent support"
issue listed in JEP-111 [3] .
Canonical Equivalents:
Unicode Standard defines "canonical equivalents" in its tr18# [4]. The
"canonical equivalents"
related item has been updated lot of times. It appears the standard
itself has kinda given up
on defining a "general approach" for the regex engine :-) with the
wordings "In practice, regex
APIs are not setup to match parts of .... it is feasible, however, to
construct patterns that will
match against NFD ..." The "canonical equivalents" is just too
complicated and too many
edge cases for a "normal" regex engine and its APIs.
We probably should have optioned to not support it if had the choice
again. Unfortunately
Java regex has spec-ed to support such "canonical equivalents" from the
very beginning
and has been "supporting" this feature with something that least works
reasonably in some
common cases for releases.
The current regex CANON_EQ support is kinda of broken, especially the
implementation for
the character class construct. It simply does not work as expected, as
reported in various
issues listed below.
https://bugs.openjdk.java.net/browse/JDK-4916384
https://bugs.openjdk.java.net/browse/JDK-4867170
https://bugs.openjdk.java.net/browse/JDK-6995635
https://bugs.openjdk.java.net/browse/JDK-6728861
https://bugs.openjdk.java.net/browse/JDK-6736245
https://bugs.openjdk.java.net/browse/JDK-7080302
How does the java regex "work" now when the CANON_EQ flag is set? the
current
implementation first
(1) converts the whole pattern into Normalizer.Form.NFD form, for
example for
the greek extended character "\u1f80", we convert it to its nfd form as
\u1f80 -> \u03b1\u0313\u0345
which has a base character \u03b1 (small alpha) followed by two
non-spacing_mark
characters \u0313 (combining comma above) and \u0345 (greek iota below)
(2) then we generate all the possible "permutations" of the characters
inside the nfd
string (based on the unicode nfd/nfc normalization rules, the base
character stays
where it is, those non-space-mark characters can be in any order for NOT
normalized
text), which includes the possible new "combination" of individual each
characters.
\u3b1\u313\u345
\u3b1\u345\u313
\u1f00\u345 (new combination \u3b1\u0313 -> \u1f00)
\u1fb3\u313 (new combination \u3b1\u0345 -> \u1fb3)
\u1f80
(3) finally a pure group is constructed with the permutations, which
will match
any canonical equivalences, to replace the original single \u1f80
(?:\u1f80|\u3b1\u313\u345|\u1f00\u345|\u3b1\u345\u313|\u1fb3\u313)
The resulting pure group, though looks tedious, can match all the canonical
equivalences (are literally listed as the "alternation" inside the pure
group construct)
of the greek character \u1f80, especially in "literal" case (slice of
characters). For
example, pattern "A\u1f80B" matches successfully for input like
"A\u3b1\u313\u345B"
"A\u3b1\u345\u313B"
"A\u1f00\u345B"
"A\u1fb3\u313B"
"A\u1f80B"
And it works fine even you put it inside a character class construct
[...] The pattern
"[\u1f80]" can successfully find its corresponding canonical
equivalences from the
above input strings.
But it starts to fail apart when you try a little more "complicated"
character class, for
example, the negation [^\u1f80A] does not match A but matches \u1f80,
and match
all of its canonical equivalences.
Range [\u1f80-\u1f82] matches \u1f80, \u1f82 and their canonical
equivalences, as
expected but it doesn't match \u1f81 (and its canonical equivalences),
as people would
normally expected (as \u1f81 is perfectly inside that range) .
The reason behind these failures is because the current implementation
converts/
normalizes the "character class" pattern the same way as it does for the
"slice of
characters" pattern. Take a look at the "normalized" pattern from the
character class
samples,
[^\u1f80A]
-->
(?:[^A]|\u3b1\u313\u345|\u1f00\u345|\u1f80|\u3b1\u345\u313|\u1fb3\u313|\u1f80)
[\u1f80-\u1f82]
-->
(?:[-]|\u3b1\u313\u345|\u1f00\u345|\u1f80|\u3b1\u345\u313|\u1fb3\u313|\u1f80|\u3b1\u313\u300\u345|...)
The current implementation simply takes those "composed" characters out
of the
character class body, converts them into alternations (as it does for
the slice)
and append them at the end of the character class construct. It just
does not work,
the resulting pattern does not really match what the original regex
pattern means
to be.
[^\u1f80A] -> any character neither is 'A' nor \u1f8a", including their
canonical
equivalences.
[[\u1f80-\u1f82] ]
-> any character (including their canonical
equivalences) with the range
of \u1f80 andu1f82
The proposed changes here are
(1) instead of normalizing everything of the pattern into nfd,
normalizing the character
class part into nfc, as the "character class" really needs to match a
"character" or the
"canonical equivalence" of this "character (composed). It can be
interpreted as to match
a "grapheme" that can be normalized into a "nfc" that matches this
"character". For
example if you have a luster of \u03b1\u0314\u0345" inside a character
class, it is
reasonable to assume you really mean to match character \u1f80 and/its
canonical
equivalences, when the CANON_EQ when the flag is on.
(2) instead of trying to generate the permutations, create the
alternation and then put
it appropriately into the character class (logically), we now use a
special "Node", the
NFCCharProperty to do the matching work. The NFCCharProperty tries to
match a
grapheme cluster at a time (nfc greedly, then backtrack) against the
character class.
For example for character class [\u1f80] or [\u03b1\u0313\u0345] the
resulting
"normalized "pattern is [\u1f80]
and for the negation the "normalized" pattern is a normal [^\u1f80] for
both [^\u1f80]
and [^\u03b1\u0313\u0345]
When matching, we try to match the input against the targeting
character class
grapheme by grapheme. The engine picks a grapheme cluster first,
normalize it
to its NFC form, and then match it against the character class.
For example,
input text: "ab\u1f81cd"
"ab\u03b1\u0314\u0345cd"
"ab\u03b1\u0345\u0314cd"
"ab\u1f01\u0345cd"
"ab\u1fb3\u0314cd"
character class: [\u1f80-\u1f82],
The engine finds the grapheme cluster "\u03b1\u0314\u0345" (or any of
the other combination)
first, normalize it into \u1f81, match it against the range
\u1f80-\u1f82, and move on.
In case there is(are) extra non-spacing-mark character in the grapheme
cluster, for example
"ab\u03b1\u0314\u0345\u0313cd" (there is any extra \u0313")
The "find" will still have a "find" at \u03b1\u0314\u0345. But the input
will not "match"
"ab[\u1f80-\u1f82]cd" (because there is extra), but it will "match"
"ab[\u1f80-\u1f82]\u0313cd".
(as an implementation detail, the engine will first try
"\u03b1\u0314\u0345\u0313",
which ends up with a nfc string"\u1f81\u0313", can NOT match the single
character, it backtracks
to "\u03b1\u0314\u0345" ...)
The approach seems to work and fix most of the troubles we have in
character class.
Here is the webrev (on top of the changes for JDK-8151481 [1])
http://cr.openjdk.java.net/~sherman/regexCE/webrev/
While the character class CANON_EQ support is the main issue we want to
address
this time, there are also couple other problems reported in the past,
they are fixed
together here.
(1) the current implement only supports base+non_spacing_marks CANON_EQ,
the
the canonical equivalence of decomposed hangul (jamos) and composed hangl
syllables is NOT supported, for example "\u1100\u1161" vs "\uac00".
fixed.
(2) Character in Unicode composition exclusion table does not match
itself, as
reported in JDK-6736245. (special composition sample,
nfd(\u2adc) -> \u2add\u0338
nfc(\u2add\u0338) -> \u2add\u0338 (NOT back to \u2adc)
fixed.
(3) regex compiling syntax error/exception when compile certain regex,
for example
"(\u00e9)" triggers
Exception in thread "main" java.util.regex.PatternSyntaxException:
Unmatched closing ')' near index 11
((?:eÌ)|é)|e)Ì)
^
fixed.
(4) the canonical equivalence does not work for the property class
"\\p{IsGreek}" matches "\u1f80"
"\\p{IsGreek}" but does match "\u1f00\u0345\u0300"
works as expected now.
thanks,
Sherman
[1]
http://mail.openjdk.java.net/pipermail/core-libs-dev/2016-March/039269.html
[2]
http://mail.openjdk.java.net/pipermail/core-libs-dev/2016-March/039404.html
[3] http://openjdk.java.net/jeps/111
[4] http://unicode.org/reports/tr18/#Canonical_Equivalents