[
https://issues.apache.org/jira/browse/LUCENE-7411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15432377#comment-15432377
]
Uwe Schindler edited comment on LUCENE-7411 at 8/23/16 8:33 AM:
----------------------------------------------------------------
I just looked at the query code. Unfortunately, the TermsEnum there does not
allow (like Lucene's original AutomatonTermsEnum) seeking through the terms
based on the automaton structure. So basically if you execute the query it will
match the terms fine, but it has to iterate all the terms in the index, which
can be many:
https://github.com/s4ke/moar/blob/master/lucene/src/main/java/com/github/s4ke/moar/lucene/query/MoarQuery.java#L65-L83
In contrast AutomatonTermsEnum
(https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/index/AutomatonTermsEnum.java)
allows to seek based on the Automaton structure (this allows "guessing" the
next candidate term to match). E.g. if you have an automaton like
"(http|ftp)://.*", the query will not iterate all terms that can never match
at all (like those not starting with "h" or "f"), so it seeks over them. It can
do this, because the terms index is also seen as an Automaton (a sorted list of
terms).
was (Author: thetaphi):
I just looked at the query code. Unfortunately, the TermsEnum there does not
allow (like Lucene's original AutomatonTermsEnum) seeking through the terms
based on the automaton states. So basically if you execute the query it will
match the terms fine, but it has to iterate all the terms in the index, which
can be many:
https://github.com/s4ke/moar/blob/master/lucene/src/main/java/com/github/s4ke/moar/lucene/query/MoarQuery.java#L65-L83
In contrast AutomatonTermsEnum
(https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/index/AutomatonTermsEnum.java)
allows to seek based on the Automaton structure (this allows "guessing" the
next candidate term to match). E.g. if you have an automaton like
"(http|ftp)://.*", the query will not iterate all terms that can never match
at all (like those not starting with "h" or "f"), so it seeks over them. It can
do this, because the terms index is also seen as an Automaton (a sorted list of
terms).
> Regex Query with Backreferences
> -------------------------------
>
> Key: LUCENE-7411
> URL: https://issues.apache.org/jira/browse/LUCENE-7411
> Project: Lucene - Core
> Issue Type: New Feature
> Components: core/search
> Reporter: Martin Braun
> Priority: Minor
>
> Hi there,
> I am currently working on a Regex Engine that supports Backreferences while
> not losing determinism. It uses Memory Occurence Automata (MOAs) in the
> engine which are more powerful than normal DFA/NFAs. The engine does no
> backtracking and recognizes Regexes that cannot be evaluated
> deterministically as malformed. It has become more and more mature in the
> last few weeks and I also implemented a Lucene Query that uses these Patterns
> in the background. Now my question is: Is there any interest for this work to
> be merged (or adapted) into Lucene core?
> https://github.com/s4ke/moar
> Usage example for the Lucene Query:
> https://github.com/s4ke/moar/blob/master/lucene/src/test/java/com/github/s4ke/moar/lucene/query/test/MoarQueryTest.java#L126
> Cheers,
> Martin
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]