Hi Sherman,

A few comments:

src/java.base/share/classes/java/util/regex/CharPredicates.java:

- The static fields like ALPHABETIC should be final
- line: 139/140 the posix and uprops HashMaps could have an initializer for size to avoid resizing (12 and 18):
- 224: typo: "categoreis"
- 232: double space: "return  props.get"

- Remove debugging prints to "PrintPattern.pmap..." before commit
- 288: can you resort "Pf" and "Pi" to be with the other P<x> entries; maybe sort the whole list.

- 305: Is the definition of "C" missing the equivalent of "Cn" unassigned or is that an intentional difference?

-323: "LD" the definition of LD is not in the reference (is it specified somewhere?) -333: the referenced spec defines the names of the character classes but not the contents; are they elsewhere?
-374: add "final" to static declarations

Pattern.java:
-2651: stray trailing "{"  after comment
-2746: indentation of arguments
-5493: javadoc the description of Predicate interface and implementations
-5637: if you want it to be the first initializer put it at the top of the file, not buried at the end.
   and make them final.

That's a start, I need to look a bit more closely at the algorithmic change that prompted the fix.

Thanks, Roger


On 3/10/2016 1:55 AM, Xueming Shen wrote:
Now we have a issue id

issue: https://bugs.openjdk.java.net/browse/JDK-8151481
webrev: http://cr.openjdk.java.net/~sherman/8151481/webrev

thanks,
sherman

btw, the patter inside [...] no longer be "printable", they are "function" now.

On 3/8/16 9:45 AM, Xueming Shen wrote:
Hi,

While waiting patiently for someone to help review the proposal for the exponential backtracking issue [1] I went ahead replacing those "CharProperty constant nodes" with the IntPredicate. We were hoping having closure back then when working on those CharProperty classes, which ended up with those make()/clone(). Now it might be the
time to replace it with what we wanted at the beginning.

http://cr.openjdk.java.net/~sherman/regexClosure/webrev.02/

Here are the notes about the changes

(1) pulled out the "broken" printNodeTree (for debugging) from the Pattern. This one does not work as expected for a while . I do have a working copy and have to put it in every time I need debug the engine. So now I replaced the printNoteTree with working one and putting it at a separate class j.u.regex.PrintPattern, which now can print out the
     clean and complete node tree of the pattern. For example,

   Pattern: [a-z0-9]+|ABCDEFG
     0:  <Start>
     1:  <Branch>
     2:    <CharPropertyGreedy +>
     3:      <Union>
     4:        <Range[a-z]>
     5:        <Range[0-9]>
         <-branch.separator->
     6:    <Slice  "ABCDEFG">
     7:  </Branch>
     8:  <END>

(2) the optimization for the greedy repetition of a "CharProperty", which parse the greedy repetition on a single "CharProperty", such as \p{IsGreek}+, or the most
     commonly used .* into a single/smooth loop node.

from

    Pattern: \p{IsGreek}+
     0:  <Start>
     1:  <Curly GREEDY  + >
     2:    <Script GREEK>
         </Curly>
     3:  <END>

to

     Pattern: \p{IsGreek}+
     0:  <Start>
     1:  <CharPropertyGreedy Script GREEK+>
     2:  <END>

The simple jmh benchmark [2] indicates it is about 50%+, especially for those no-match
   case.

(3) the optimization for the "union" of various individual "char" inside a chracter class [...], usch as. [ABCDEF]. For a regex like [a-zABCDEF], now the engine generates
     the nodes like

   Pattern: [a-zABCDEF]
     0:  <Start>
     1:  <Union>
     2:    <Union>
     3:      <Union>
     4:        <Union>
     5:          <Union>
     6:            <Union>
     7:              <Range[a-z]>
     8:              <Bits [ A B C D E F]>
     8:            <Bits [ A B C D E F]>
     8:          <Bits [ A B C D E F]>
     8:        <Bits [ A B C D E F]>
     8:      <Bits [ A B C D E F]>
     8:    <Bits [ A B C D E F]>
     9:  <END>

with the optimization it generate (which it should)

   Pattern: [a-zABCDEF]
     0:  <Start>
     1:  <Union>
     2:    <Range[a-z]>
     3:    <Bits [ A B C D E F]>
     4:  <END>

The jmh benchmark [2] also indicates it is much faster, especially for those no-match
   case.

(4) replaced those "constant" CharProperty nodes with IntPredicate (the main change :-)).

if have time I might go further up to replace the ""CharProperty.isSatisfiedBy" with a function "predicate" directly ... in "class" case, we actually don't really need a "node", the only thing we really care about is the "predicate" and their combination, only one node for each "class".

oh, there is another one
(5) I just threw in the change for the "j.u.regex: Negated Character Classes" [3]
     (I can take it out, if anyone has trouble with this)

Thanks,
Sherman

[1] http://mail.openjdk.java.net/pipermail/core-libs-dev/2016-March/039269.html
[2] http://cr.openjdk.java.net/~sherman/regexClosure/MyBenchmark.java
[3] http://mail.openjdk.java.net/pipermail/core-libs-dev/2011-June/006957.html


Reply via email to