Re: RFR: Regex exponential backtracking issue --- more cleanup/tuning

Roger Riggs Sat, 19 Mar 2016 17:13:22 -0700

Hi Sherman,

A few comments:


src/java.base/share/classes/java/util/regex/CharPredicates.java:

- The static fields like ALPHABETIC should be final

- line: 139/140 the posix and uprops HashMaps could have an initializerfor size to avoid resizing (12 and 18):

- 224: typo: "categoreis"
- 232: double space: "return  props.get"

- Remove debugging prints to "PrintPattern.pmap..." before commit

- 288: can you resort "Pf" and "Pi" to be with the other P<x> entries;maybe sort the whole list.

- 305: Is the definition of "C" missing the equivalent of "Cn"unassigned or is that an intentional difference?

-323: "LD" the definition of LD is not in the reference (is itspecified somewhere?)-333: the referenced spec defines the names of the character classes butnot the contents; are they elsewhere?

-374: add "final" to static declarations

Pattern.java:
-2651: stray trailing "{"  after comment
-2746: indentation of arguments
-5493: javadoc the description of Predicate interface and implementations

-5637: if you want it to be the first initializer put it at the top ofthe file, not buried at the end.

   and make them final.

That's a start, I need to look a bit more closely at the algorithmicchange that prompted the fix.


Thanks, Roger


On 3/10/2016 1:55 AM, Xueming Shen wrote:

Now we have a issue id

issue: https://bugs.openjdk.java.net/browse/JDK-8151481
webrev: http://cr.openjdk.java.net/~sherman/8151481/webrev

thanks,
sherman
btw, the patter inside [...] no longer be "printable", they are"function" now.
On 3/8/16 9:45 AM, Xueming Shen wrote:
Hi,
While waiting patiently for someone to help review the proposal forthe exponentialbacktracking issue [1] I went ahead replacing those "CharPropertyconstant nodes" withthe IntPredicate. We were hoping having closure back then whenworking on thoseCharProperty classes, which ended up with those make()/clone(). Nowit might be the
time to replace it with what we wanted at the beginning.

http://cr.openjdk.java.net/~sherman/regexClosure/webrev.02/

Here are the notes about the changes
(1) pulled out the "broken" printNodeTree (for debugging) from thePattern. This one doesnot work as expected for a while . I do have a working copy andhave to put it in everytime I need debug the engine. So now I replaced theprintNoteTree with working oneand putting it at a separate class j.u.regex.PrintPattern, whichnow can print out the
     clean and complete node tree of the pattern. For example,

   Pattern: [a-z0-9]+|ABCDEFG
     0:  <Start>
     1:  <Branch>
     2:    <CharPropertyGreedy +>
     3:      <Union>
     4:        <Range[a-z]>
     5:        <Range[0-9]>
         <-branch.separator->
     6:    <Slice  "ABCDEFG">
     7:  </Branch>
     8:  <END>
(2) the optimization for the greedy repetition of a "CharProperty",which parse thegreedy repetition on a single "CharProperty", such as\p{IsGreek}+, or the most
     commonly used .* into a single/smooth loop node.

from

    Pattern: \p{IsGreek}+
     0:  <Start>
     1:  <Curly GREEDY  + >
     2:    <Script GREEK>
         </Curly>
     3:  <END>

to

     Pattern: \p{IsGreek}+
     0:  <Start>
     1:  <CharPropertyGreedy Script GREEK+>
     2:  <END>
The simple jmh benchmark [2] indicates it is about 50%+,especially for those no-match
   case.
(3) the optimization for the "union" of various individual "char"inside a chracter class[...], usch as. [ABCDEF]. For a regex like [a-zABCDEF], now theengine generates
     the nodes like

   Pattern: [a-zABCDEF]
     0:  <Start>
     1:  <Union>
     2:    <Union>
     3:      <Union>
     4:        <Union>
     5:          <Union>
     6:            <Union>
     7:              <Range[a-z]>
     8:              <Bits [ A B C D E F]>
     8:            <Bits [ A B C D E F]>
     8:          <Bits [ A B C D E F]>
     8:        <Bits [ A B C D E F]>
     8:      <Bits [ A B C D E F]>
     8:    <Bits [ A B C D E F]>
     9:  <END>

with the optimization it generate (which it should)

   Pattern: [a-zABCDEF]
     0:  <Start>
     1:  <Union>
     2:    <Range[a-z]>
     3:    <Bits [ A B C D E F]>
     4:  <END>
The jmh benchmark [2] also indicates it is much faster, especiallyfor those no-match
   case.
(4) replaced those "constant" CharProperty nodes with IntPredicate(the main change :-)).
if have time I might go further up to replace the""CharProperty.isSatisfiedBy" with a function"predicate" directly ... in "class" case, we actually don'treally need a "node", the only thingwe really care about is the "predicate" and their combination,only one node for each "class".
oh, there is another one
(5) I just threw in the change for the "j.u.regex: Negated CharacterClasses" [3]
     (I can take it out, if anyone has trouble with this)

Thanks,
Sherman
[1]http://mail.openjdk.java.net/pipermail/core-libs-dev/2016-March/039269.html
[2] http://cr.openjdk.java.net/~sherman/regexClosure/MyBenchmark.java
[3]http://mail.openjdk.java.net/pipermail/core-libs-dev/2011-June/006957.html

Re: RFR: Regex exponential backtracking issue --- more cleanup/tuning

Reply via email to