[jira] [Commented] (LUCENE-8102) CompiledAutomaton performance for determining common suffix

Robert Muir (JIRA) Mon, 18 Dec 2017 05:49:36 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-8102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16295003#comment-16295003
 ]


Robert Muir commented on LUCENE-8102:
-------------------------------------

The optimization speeds up leading wildcard type-queries (such as wildcard 
"*foo" or regex ".*foo") which are far more common e.g. on the users list than 
the example presented on this issue, so I really think we should keep it. 

The regex here is quite obscene, maybe its the thing that should be optimized? 
:)


> CompiledAutomaton performance for determining common suffix
> -----------------------------------------------------------
>
>                 Key: LUCENE-8102
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8102
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/FSTs
>    Affects Versions: 7.1
>            Reporter: Thomas Poppe
>
> We're using the automaton package as part of Elasticsearch for doing regexp 
> queries.  Our business requires us to process rather complex regular 
> expressions, for example (we have more complex examples, but this one 
> illustrates the problem):
> {noformat}
>         (¦.)*(¦?[^¦]){1,10}ab(¦.)*(¦?[^¦]){1,10}c(¦.)*(¦?[^¦]){1,10}d
> {noformat}
> With a large enough value of maxDeterminizedStates, this works.  The problem 
> we're having is that the conversion of this regular expression to a 
> CompiledAutomaton takes very long.  Almost all of the time goes into 
> determining the common suffix for the Automaton (which is "d" in this 
> example) - calculated with a call to Operations.getCommonSuffixBytesRef.
> This suffix is only used as an optimization.  Skipping the calculation of 
> this suffix allows us to process these kinds of queries.
> - Would it be possible to introduce a way to skip the calculation of this 
> common suffix (ideally something we control from within our query to 
> Elasticsearch)?
> - Or would it be possible to take a look at this getCommonSuffixBytesRef 
> operation, to see if it can be optimized?  Most of the time goes to 
> determinizing the reversed automaton - maybe this can be avoided somehow?
> Reaction from Mike McCandless on the mailing list:
> This is just an optimization; maybe we should expose an option to disable it?
> Or maybe we can find the common suffix on an NFA instead, to avoid 
> determinization?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8102) CompiledAutomaton performance for determining common suffix

Reply via email to