[ https://issues.apache.org/jira/browse/LUCENE-7465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15522355#comment-15522355 ]
Dawid Weiss commented on LUCENE-7465: ------------------------------------- Interesting that it's faster than PatternTokenizer! I haven't looked at the patch, Mike, I did some experiments recently with regexp benchmarking (for our internal needs) and fairly large regular expression patterns (over an even larger inputs). The native java pattern implementation always won by a large (and I mean: super large) margin over anything else I tried. I tried brics, re2 (java port), re2 (native implementation), Apache ORO (out of curiosity only, it didn't pass correctness tests for me). Brics wasn't too bad, but the gain from early detection of "too hard" DFA expressions was overshadowed by DFA expansion (very large automata in our case), so unless you don't have control over the patterns (in which case adversarial cases can be executed), it didn't make sense for me to switch. Also, the fact that the java implementation was fast was quite surprising to me as we had a large number of alternatives in regular expressions and I thought these would nicely yield to automaton optimizations (pullup of prefix matching, etc.). In the end, it didn't seem to matter. So perhaps the performance is a factor of how complex the regular expressions are (and how they're benchmarked)? Don't know. > Add a PatternTokenizer that uses Lucene's RegExp implementation > --------------------------------------------------------------- > > Key: LUCENE-7465 > URL: https://issues.apache.org/jira/browse/LUCENE-7465 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Michael McCandless > Assignee: Michael McCandless > Fix For: master (7.0), 6.3 > > Attachments: LUCENE-7465.patch, LUCENE-7465.patch > > > I think there are some nice benefits to a version of PatternTokenizer that > uses Lucene's RegExp impl instead of the JDK's: > * Lucene's RegExp is compiled to a DFA up front, so if a "too hard" RegExp > is attempted the user discovers it up front instead of later on when a > "lucky" document arrives > * It processes the incoming characters as a stream, only pulling 128 > characters at a time, vs the existing {{PatternTokenizer}} which currently > reads the entire string up front (this has caused heap problems in the past) > * It should be fast. > I named it {{SimplePatternTokenizer}}, and it still needs a factory and > improved tests, but I think it's otherwise close. > It currently does not take a {{group}} parameter because Lucene's RegExps > don't yet implement sub group capture. I think we could add that at some > point, but it's a bit tricky. > This doesn't even have group=-1 support (like String.split) ... I think if we > did that we should maybe name it differently > ({{SimplePatternSplitTokenizer}}?). -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org