[jira] [Commented] (LUCENE-7465) Add a PatternTokenizer that uses Lucene's RegExp implementation

David Smiley (JIRA) Fri, 27 Jan 2017 05:32:50 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-7465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15842872#comment-15842872
 ]


David Smiley commented on LUCENE-7465:
--------------------------------------

bq. (Adrien) I like the separate factory idea better, it makes it easier to 
evolve those two impls separately, eg. in the case that we decide to deprecate 
PatternTokenizer or to move it to sandbox.

I think the factory isn't going to stand in the way of either tokenizer 
evolving.  A problem with separate factories is that the name 
{{PatternTokenizerFactory}} is already an excellent name, nor does it have 
hints as to how it works.  In general I don't like polluting the namespace with 
different implementations of effectively the same thing; the first impl to show 
up grabs the best name.  The Factory provides an excellent opportunity to 
bridge these multiple implementations.

Yet alas, my arguments aren't swaying anyone so go ahead.

> Add a PatternTokenizer that uses Lucene's RegExp implementation
> ---------------------------------------------------------------
>
>                 Key: LUCENE-7465
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7465
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: master (7.0), 6.4
>
>         Attachments: LUCENE-7465.patch, LUCENE-7465.patch
>
>
> I think there are some nice benefits to a version of PatternTokenizer that 
> uses Lucene's RegExp impl instead of the JDK's:
>   * Lucene's RegExp is compiled to a DFA up front, so if a "too hard" RegExp 
> is attempted the user discovers it up front instead of later on when a 
> "lucky" document arrives
>   * It processes the incoming characters as a stream, only pulling 128 
> characters at a time, vs the existing {{PatternTokenizer}} which currently 
> reads the entire string up front (this has caused heap problems in the past)
>   * It should be fast.
> I named it {{SimplePatternTokenizer}}, and it still needs a factory and 
> improved tests, but I think it's otherwise close.
> It currently does not take a {{group}} parameter because Lucene's RegExps 
> don't yet implement sub group capture.  I think we could add that at some 
> point, but it's a bit tricky.
> This doesn't even have group=-1 support (like String.split) ... I think if we 
> did that we should maybe name it differently 
> ({{SimplePatternSplitTokenizer}}?).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-7465) Add a PatternTokenizer that uses Lucene's RegExp implementation

Reply via email to