Hi Sherman,
A few comments:
src/java.base/share/classes/java/util/regex/CharPredicates.java:
- The static fields like ALPHABETIC should be final
- line: 139/140 the posix and uprops HashMaps could have an initializer
for size to avoid resizing (12 and 18):
- 224: typo: "categoreis"
- 232: double space: "return props.get"
- Remove debugging prints to "PrintPattern.pmap..." before commit
- 288: can you resort "Pf" and "Pi" to be with the other P<x> entries;
maybe sort the whole list.
- 305: Is the definition of "C" missing the equivalent of "Cn"
unassigned or is that an intentional difference?
-323: "LD" the definition of LD is not in the reference (is it
specified somewhere?)
-333: the referenced spec defines the names of the character classes but
not the contents; are they elsewhere?
-374: add "final" to static declarations
Pattern.java:
-2651: stray trailing "{" after comment
-2746: indentation of arguments
-5493: javadoc the description of Predicate interface and implementations
-5637: if you want it to be the first initializer put it at the top of
the file, not buried at the end.
and make them final.
That's a start, I need to look a bit more closely at the algorithmic
change that prompted the fix.
Thanks, Roger
On 3/10/2016 1:55 AM, Xueming Shen wrote:
Now we have a issue id
issue: https://bugs.openjdk.java.net/browse/JDK-8151481
webrev: http://cr.openjdk.java.net/~sherman/8151481/webrev
thanks,
sherman
btw, the patter inside [...] no longer be "printable", they are
"function" now.
On 3/8/16 9:45 AM, Xueming Shen wrote:
Hi,
While waiting patiently for someone to help review the proposal for
the exponential
backtracking issue [1] I went ahead replacing those "CharProperty
constant nodes" with
the IntPredicate. We were hoping having closure back then when
working on those
CharProperty classes, which ended up with those make()/clone(). Now
it might be the
time to replace it with what we wanted at the beginning.
http://cr.openjdk.java.net/~sherman/regexClosure/webrev.02/
Here are the notes about the changes
(1) pulled out the "broken" printNodeTree (for debugging) from the
Pattern. This one does
not work as expected for a while . I do have a working copy and
have to put it in every
time I need debug the engine. So now I replaced the
printNoteTree with working one
and putting it at a separate class j.u.regex.PrintPattern, which
now can print out the
clean and complete node tree of the pattern. For example,
Pattern: [a-z0-9]+|ABCDEFG
0: <Start>
1: <Branch>
2: <CharPropertyGreedy +>
3: <Union>
4: <Range[a-z]>
5: <Range[0-9]>
<-branch.separator->
6: <Slice "ABCDEFG">
7: </Branch>
8: <END>
(2) the optimization for the greedy repetition of a "CharProperty",
which parse the
greedy repetition on a single "CharProperty", such as
\p{IsGreek}+, or the most
commonly used .* into a single/smooth loop node.
from
Pattern: \p{IsGreek}+
0: <Start>
1: <Curly GREEDY + >
2: <Script GREEK>
</Curly>
3: <END>
to
Pattern: \p{IsGreek}+
0: <Start>
1: <CharPropertyGreedy Script GREEK+>
2: <END>
The simple jmh benchmark [2] indicates it is about 50%+,
especially for those no-match
case.
(3) the optimization for the "union" of various individual "char"
inside a chracter class
[...], usch as. [ABCDEF]. For a regex like [a-zABCDEF], now the
engine generates
the nodes like
Pattern: [a-zABCDEF]
0: <Start>
1: <Union>
2: <Union>
3: <Union>
4: <Union>
5: <Union>
6: <Union>
7: <Range[a-z]>
8: <Bits [ A B C D E F]>
8: <Bits [ A B C D E F]>
8: <Bits [ A B C D E F]>
8: <Bits [ A B C D E F]>
8: <Bits [ A B C D E F]>
8: <Bits [ A B C D E F]>
9: <END>
with the optimization it generate (which it should)
Pattern: [a-zABCDEF]
0: <Start>
1: <Union>
2: <Range[a-z]>
3: <Bits [ A B C D E F]>
4: <END>
The jmh benchmark [2] also indicates it is much faster, especially
for those no-match
case.
(4) replaced those "constant" CharProperty nodes with IntPredicate
(the main change :-)).
if have time I might go further up to replace the
""CharProperty.isSatisfiedBy" with a function
"predicate" directly ... in "class" case, we actually don't
really need a "node", the only thing
we really care about is the "predicate" and their combination,
only one node for each "class".
oh, there is another one
(5) I just threw in the change for the "j.u.regex: Negated Character
Classes" [3]
(I can take it out, if anyone has trouble with this)
Thanks,
Sherman
[1]
http://mail.openjdk.java.net/pipermail/core-libs-dev/2016-March/039269.html
[2] http://cr.openjdk.java.net/~sherman/regexClosure/MyBenchmark.java
[3]
http://mail.openjdk.java.net/pipermail/core-libs-dev/2011-June/006957.html