[
https://issues.apache.org/jira/browse/LUCENE-7411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15432352#comment-15432352
]
Martin Braun edited comment on LUCENE-7411 at 8/23/16 8:12 AM:
---------------------------------------------------------------
MOAs are a different type of automaton. Their expressive power is greater than
the regular theoretical model of NFAs/DFAs.
NFA/DFA Regex engines usually (as far as I know) either build a NFA and run it,
build a NFA and construct the DFA, or simulate the NFA implicitly. Then there
are the standard Java Regexes which don't really rely on any NFA/DFA logic and
also do some special snoflakey stuff that Perl once did in special cases.
Now this is the dilemma: For some applications (like Lucene) you might want to
allow Backreferences in your Regexes. But normal Perl Regexes are a beast to
handle as matching with them is a NP hard problem (http://perl.plover.com/NPC/)
and you probably don't want your index query to take extremely long and then
yielding only some few results. The theoretical work my supervisor (it is him
and a colleague of his that write the theoretical paper) does is looking into
whether it is possible to keep some of the benefits of Backreferences by
restricting the capabilities and still having more expressive power than
regular NFA/DFA Regexes. Restriction means for example that Regexes like
(?<x>a*)\1 are not allowed as we cannot tell whether we should match the next a
to the group or not. But we allow stuff like (?<x>a*)|\1 because we know that
we can safely add the next a to the matching group as well.
Also at the moment we don't differentiate between lazy or greedy group
matching. If we can't tell what part of a regex we have to match the next
character to, we consider this a non deterministic regex (The definition and
handling of these cases in the library might change if people really need them).
This Query type is also not recommended as a drop in replacement for Lucenes
Regexp Query, it is only meant for the people that really need backreference
support in their regexes.
was (Author: s4ke):
MOAs are a different way to handle automata. Their expressive power is greater
than the regular theoretical model of NFAs/DFAs.
NFA/DFA Regex engines usually (as far as I know) either build a NFA and run it,
build a NFA and construct the DFA, or simulate the NFA implicitly. Then there
are the standard Java Regexes which don't really rely on any NFA/DFA logic and
also do some special snoflakey stuff that Perl once did in special cases.
Now this is the dilemma: For some applications (like Lucene) you might want to
allow Backreferences in your Regexes. But normal Perl Regexes are a beast to
handle as matching with them is a NP hard problem (http://perl.plover.com/NPC/)
and you probably don't want your index query to take extremely long and then
yielding only some few results. The theoretical work my supervisor (it is him
and a colleague of his that write the theoretical paper) does is looking into
whether it is possible to keep some of the benefits of Backreferences by
restricting the capabilities and still having more expressive power than
regular NFA/DFA Regexes. Restriction means for example that Regexes like
(?<x>a*)\1 are not allowed as we cannot tell whether we should match the next a
to the group or not. But we allow stuff like (?<x>a*)|\1 because we know that
we can safely add the next a to the matching group as well.
Also at the moment we don't differentiate between lazy or greedy group
matching. If we can't tell what part of a regex we have to match the next
character to, we consider this a non deterministic regex (The definition and
handling of these cases in the library might change if people really need them).
This Query type is also not recommended as a drop in replacement for Lucenes
Regexp Query, it is only meant for the people that really need backreference
support in their regexes.
> Regex Query with Backreferences
> -------------------------------
>
> Key: LUCENE-7411
> URL: https://issues.apache.org/jira/browse/LUCENE-7411
> Project: Lucene - Core
> Issue Type: New Feature
> Components: core/search
> Reporter: Martin Braun
> Priority: Minor
>
> Hi there,
> I am currently working on a Regex Engine that supports Backreferences while
> not losing determinism. It uses Memory Occurence Automata (MOAs) in the
> engine which are more powerful than normal DFA/NFAs. The engine does no
> backtracking and recognizes Regexes that cannot be evaluated
> deterministically as malformed. It has become more and more mature in the
> last few weeks and I also implemented a Lucene Query that uses these Patterns
> in the background. Now my question is: Is there any interest for this work to
> be merged (or adapted) into Lucene core?
> https://github.com/s4ke/moar
> Usage example for the Lucene Query:
> https://github.com/s4ke/moar/blob/master/lucene/src/test/java/com/github/s4ke/moar/lucene/query/test/MoarQueryTest.java#L126
> Cheers,
> Martin
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]