RFR: Regex canonical equivalents

Xueming Shen Sat, 19 Mar 2016 09:40:18 -0700

Hi,

While still waiting patiently for the review for


RFR: Regex exponential backtracking issue [1]
RFR: Regex exponential backtracking issue --- more cleanup/tuning [2]

Here is the third round of change to address the "broken canonicalequivalent support"

issue listed in JEP-111 [3] .

Canonical Equivalents:

Unicode Standard defines "canonical equivalents" in its tr18# [4]. The"canonical equivalents"related item has been updated lot of times. It appears the standarditself has kinda given upon defining a "general approach" for the regex engine :-) with thewordings "In practice, regexAPIs are not setup to match parts of .... it is feasible, however, toconstruct patterns that willmatch against NFD ..." The "canonical equivalents" is just toocomplicated and too many

edge cases for a "normal" regex engine and its APIs.

We probably should have optioned to not support it if had the choiceagain. UnfortunatelyJava regex has spec-ed to support such "canonical equivalents" from thevery beginningand has been "supporting" this feature with something that least worksreasonably in some

common cases for releases.

The current regex CANON_EQ support is kinda of broken, especially theimplementation forthe character class construct. It simply does not work as expected, asreported in various

issues listed below.

https://bugs.openjdk.java.net/browse/JDK-4916384
https://bugs.openjdk.java.net/browse/JDK-4867170
https://bugs.openjdk.java.net/browse/JDK-6995635
https://bugs.openjdk.java.net/browse/JDK-6728861
https://bugs.openjdk.java.net/browse/JDK-6736245
https://bugs.openjdk.java.net/browse/JDK-7080302

How does the java regex "work" now when the CANON_EQ flag is set? thecurrent

implementation first

(1) converts the whole pattern into Normalizer.Form.NFD form, forexample for

the greek extended character "\u1f80", we convert it to its nfd form as

\u1f80 -> \u03b1\u0313\u0345

which has a base character \u03b1 (small alpha) followed by twonon-spacing_mark

characters \u0313 (combining comma above) and \u0345 (greek iota below)

(2) then we generate all the possible "permutations" of the charactersinside the nfdstring (based on the unicode nfd/nfc normalization rules, the basecharacter stayswhere it is, those non-space-mark characters can be in any order for NOTnormalizedtext), which includes the possible new "combination" of individual eachcharacters.


\u3b1\u313\u345
\u3b1\u345\u313
\u1f00\u345 (new combination \u3b1\u0313 -> \u1f00)
\u1fb3\u313 (new combination \u3b1\u0345 -> \u1fb3)
\u1f80

(3) finally a pure group is constructed with the permutations, whichwill match

any canonical equivalences, to replace the original single \u1f80

(?:\u1f80|\u3b1\u313\u345|\u1f00\u345|\u3b1\u345\u313|\u1fb3\u313)

The resulting pure group, though looks tedious, can match all the canonical

equivalences (are literally listed as the "alternation" inside the puregroup construct)of the greek character \u1f80, especially in "literal" case (slice ofcharacters). For

example, pattern "A\u1f80B" matches successfully for input like

"A\u3b1\u313\u345B"
"A\u3b1\u345\u313B"
"A\u1f00\u345B"
"A\u1fb3\u313B"
"A\u1f80B"

And it works fine even you put it inside a character class construct[...] The pattern"[\u1f80]" can successfully find its corresponding canonicalequivalences from the

above input strings.

But it starts to fail apart when you try a little more "complicated"character class, forexample, the negation [^\u1f80A] does not match A but matches \u1f80,and match

all of its canonical equivalences.

Range [\u1f80-\u1f82] matches \u1f80, \u1f82 and their canonicalequivalences, asexpected but it doesn't match \u1f81 (and its canonical equivalences),as people would

normally expected (as \u1f81 is perfectly inside that range) .

The reason behind these failures is because the current implementationconverts/normalizes the "character class" pattern the same way as it does for the"slice ofcharacters" pattern. Take a look at the "normalized" pattern from thecharacter class

samples,

[^\u1f80A]

-->(?:[^A]|\u3b1\u313\u345|\u1f00\u345|\u1f80|\u3b1\u345\u313|\u1fb3\u313|\u1f80)


[\u1f80-\u1f82]

-->(?:[-]|\u3b1\u313\u345|\u1f00\u345|\u1f80|\u3b1\u345\u313|\u1fb3\u313|\u1f80|\u3b1\u313\u300\u345|...)

The current implementation simply takes those "composed" characters outof thecharacter class body, converts them into alternations (as it does forthe slice)and append them at the end of the character class construct. It justdoes not work,the resulting pattern does not really match what the original regexpattern means

to be.

[^\u1f80A] -> any character neither is 'A' nor \u1f8a", including theircanonical

                        equivalences.
[[\u1f80-\u1f82] ]

-> any character (including their canonicalequivalences) with the range

                       of \u1f80 andu1f82

The proposed changes here are

(1) instead of normalizing everything of the pattern into nfd,normalizing the characterclass part into nfc, as the "character class" really needs to match a"character" or the"canonical equivalence" of this "character (composed). It can beinterpreted as to matcha "grapheme" that can be normalized into a "nfc" that matches this"character". Forexample if you have a luster of \u03b1\u0314\u0345" inside a characterclass, it isreasonable to assume you really mean to match character \u1f80 and/itscanonical

equivalences, when the CANON_EQ when the flag is on.

(2) instead of trying to generate the permutations, create thealternation and then putit appropriately into the character class (logically), we now use aspecial "Node", theNFCCharProperty to do the matching work. The NFCCharProperty tries tomatch agrapheme cluster at a time (nfc greedly, then backtrack) against thecharacter class.

For example for character class [\u1f80] or [\u03b1\u0313\u0345] theresulting

"normalized "pattern is [\u1f80]

and for the negation the "normalized" pattern is a normal [^\u1f80] forboth [^\u1f80]

and [^\u03b1\u0313\u0345]

When matching, we try to match the input against the targetingcharacter classgrapheme by grapheme. The engine picks a grapheme cluster first,normalize it

to its NFC form, and then match it against the character class.

For example,

input text:  "ab\u1f81cd"
                  "ab\u03b1\u0314\u0345cd"
                  "ab\u03b1\u0345\u0314cd"
                  "ab\u1f01\u0345cd"
                  "ab\u1fb3\u0314cd"

character class: [\u1f80-\u1f82],

The engine finds the grapheme cluster "\u03b1\u0314\u0345" (or any ofthe other combination)first, normalize it into \u1f81, match it against the range\u1f80-\u1f82, and move on.

In case there is(are) extra non-spacing-mark character in the graphemecluster, for example


"ab\u03b1\u0314\u0345\u0313cd"    (there is any extra \u0313")

The "find" will still have a "find" at \u03b1\u0314\u0345. But the inputwill not "match""ab[\u1f80-\u1f82]cd" (because there is extra), but it will "match""ab[\u1f80-\u1f82]\u0313cd".

(as an implementation detail, the engine will first try"\u03b1\u0314\u0345\u0313",which ends up with a nfc string"\u1f81\u0313", can NOT match the singlecharacter, it backtracks

to "\u03b1\u0314\u0345" ...)

The approach seems to work and fix most of the troubles we have incharacter class.


Here is the webrev (on top of the changes for JDK-8151481 [1])

http://cr.openjdk.java.net/~sherman/regexCE/webrev/

While the character class CANON_EQ support is the main issue we want toaddressthis time, there are also couple other problems reported in the past,they are fixed

together here.

(1) the current implement only supports base+non_spacing_marks CANON_EQ,the

the canonical equivalence of decomposed hangul (jamos) and composed hangl
syllables is NOT supported, for example "\u1100\u1161" vs "\uac00".

fixed.

(2) Character in Unicode composition exclusion table does not matchitself, as

reported in JDK-6736245. (special composition sample,
nfd(\u2adc) -> \u2add\u0338
nfc(\u2add\u0338) -> \u2add\u0338 (NOT back to \u2adc)

fixed.

(3) regex compiling syntax error/exception when compile certain regex,for example

"(\u00e9)" triggers

Exception in thread "main" java.util.regex.PatternSyntaxException:Unmatched closing ')' near index 11

((?:eÌ)|Ã©)|e)Ì)
^
fixed.

(4) the canonical equivalence does not work for the property class
"\\p{IsGreek}" matches "\u1f80"
"\\p{IsGreek}" but does match "\u1f00\u0345\u0300"

works as expected now.

thanks,
Sherman

[1]http://mail.openjdk.java.net/pipermail/core-libs-dev/2016-March/039269.html[2]http://mail.openjdk.java.net/pipermail/core-libs-dev/2016-March/039404.html

[3] http://openjdk.java.net/jeps/111
[4] http://unicode.org/reports/tr18/#Canonical_Equivalents

RFR: Regex canonical equivalents

Reply via email to