[jira] [Commented] (LUCENE-7465) Add a PatternTokenizer that uses Lucene's RegExp implementation

Michael McCandless (JIRA) Mon, 26 Sep 2016 05:53:33 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-7465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15522972#comment-15522972
 ]


Michael McCandless commented on LUCENE-7465:
--------------------------------------------

[~dawid.weiss] is this a benchmark I could try to run?  My regexp was 
admittedly trivial so it would be nice to have a beefier real-world regexp to 
play with ;)

The bench is also trivial (I pushed it to luceneutil).

When you tested dk.brics, did you call the {{RunAutomaton.setAlphabet}}?  This 
should be a biggish speedup, especially if your regexp has many unique 
character start/end ranges.  In Lucene's fork of dk.briks we automatically do 
that in the utf8 case, and with this patch, also for the first 256 unicode 
characters in the full character case.

> Add a PatternTokenizer that uses Lucene's RegExp implementation
> ---------------------------------------------------------------
>
>                 Key: LUCENE-7465
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7465
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: master (7.0), 6.3
>
>         Attachments: LUCENE-7465.patch, LUCENE-7465.patch
>
>
> I think there are some nice benefits to a version of PatternTokenizer that 
> uses Lucene's RegExp impl instead of the JDK's:
>   * Lucene's RegExp is compiled to a DFA up front, so if a "too hard" RegExp 
> is attempted the user discovers it up front instead of later on when a 
> "lucky" document arrives
>   * It processes the incoming characters as a stream, only pulling 128 
> characters at a time, vs the existing {{PatternTokenizer}} which currently 
> reads the entire string up front (this has caused heap problems in the past)
>   * It should be fast.
> I named it {{SimplePatternTokenizer}}, and it still needs a factory and 
> improved tests, but I think it's otherwise close.
> It currently does not take a {{group}} parameter because Lucene's RegExps 
> don't yet implement sub group capture.  I think we could add that at some 
> point, but it's a bit tricky.
> This doesn't even have group=-1 support (like String.split) ... I think if we 
> did that we should maybe name it differently 
> ({{SimplePatternSplitTokenizer}}?).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7465) Add a PatternTokenizer that uses Lucene's RegExp implementation

Reply via email to