[ 
https://issues.apache.org/jira/browse/LUCENE-7411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15432352#comment-15432352
 ] 

Martin Braun edited comment on LUCENE-7411 at 8/23/16 8:13 AM:
---------------------------------------------------------------

MOAs are a different type of automaton. Their expressive power is greater than 
the regular theoretical model of NFAs/DFAs. 

NFA/DFA Regex engines usually (as far as I know) either build a NFA and run it, 
build a NFA and construct the DFA, or simulate the NFA implicitly. Then there 
are the standard Java Regexes which don't really rely on any NFA/DFA logic and 
also do some special snoflakey stuff that Perl once did in special cases.

Now this is the dilemma: For some applications (like Lucene) you might want to 
allow Backreferences in your Regexes. But normal Perl Regexes are a beast to 
handle as matching with them is a NP hard problem (http://perl.plover.com/NPC/) 
and you probably don't want your index query to take extremely long and then 
yielding only some few results. The theoretical work my supervisor (it is him 
and a colleague of his that write the theoretical paper) does is looking into 
whether it is possible to keep some of the benefits of Backreferences by 
restricting the capabilities and still having more expressive power than 
regular NFA/DFA Regexes. Restriction means for example that Regexes like 
(?<x>a*)\1 are not allowed as we cannot tell whether we should match the next a 
to the group or not. But we allow stuff like (?<x>a*)|\1 because we know that 
we can safely add the next a to the matching group as well.

Also at the moment we don't differentiate between lazy or greedy group 
matching. If we can't tell what part of a regex we have to match the next 
character to, we consider this a non deterministic regex (The definition and 
handling of these cases in the library might change if people really need them).

This Query type is also not recommended as a drop in replacement for Lucenes 
Regexp Query, it is only meant for the people that really need backreference 
support in their regexes.

EDIT:

As of now it is also possible to create the automata by hand as the Regexes are 
not as powerful as the MOAs, but I don't know whether that has a real world 
application and is not only meant for people that want to mess with the 
automaton model.


was (Author: s4ke):
MOAs are a different type of automaton. Their expressive power is greater than 
the regular theoretical model of NFAs/DFAs. 

NFA/DFA Regex engines usually (as far as I know) either build a NFA and run it, 
build a NFA and construct the DFA, or simulate the NFA implicitly. Then there 
are the standard Java Regexes which don't really rely on any NFA/DFA logic and 
also do some special snoflakey stuff that Perl once did in special cases.

Now this is the dilemma: For some applications (like Lucene) you might want to 
allow Backreferences in your Regexes. But normal Perl Regexes are a beast to 
handle as matching with them is a NP hard problem (http://perl.plover.com/NPC/) 
and you probably don't want your index query to take extremely long and then 
yielding only some few results. The theoretical work my supervisor (it is him 
and a colleague of his that write the theoretical paper) does is looking into 
whether it is possible to keep some of the benefits of Backreferences by 
restricting the capabilities and still having more expressive power than 
regular NFA/DFA Regexes. Restriction means for example that Regexes like 
(?<x>a*)\1 are not allowed as we cannot tell whether we should match the next a 
to the group or not. But we allow stuff like (?<x>a*)|\1 because we know that 
we can safely add the next a to the matching group as well.

Also at the moment we don't differentiate between lazy or greedy group 
matching. If we can't tell what part of a regex we have to match the next 
character to, we consider this a non deterministic regex (The definition and 
handling of these cases in the library might change if people really need them).

This Query type is also not recommended as a drop in replacement for Lucenes 
Regexp Query, it is only meant for the people that really need backreference 
support in their regexes.

> Regex Query with Backreferences
> -------------------------------
>
>                 Key: LUCENE-7411
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7411
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: core/search
>            Reporter: Martin Braun
>            Priority: Minor
>
> Hi there,
> I am currently working on a Regex Engine that supports Backreferences while 
> not losing determinism. It uses Memory Occurence Automata (MOAs) in the 
> engine which are more powerful than normal DFA/NFAs. The engine does no 
> backtracking and recognizes Regexes that cannot be evaluated 
> deterministically as malformed. It has become more and more mature in the 
> last few weeks and I also implemented a Lucene Query that uses these Patterns 
> in the background. Now my question is: Is there any interest for this work to 
> be merged (or adapted) into Lucene core?
> https://github.com/s4ke/moar
> Usage example for the Lucene Query:
> https://github.com/s4ke/moar/blob/master/lucene/src/test/java/com/github/s4ke/moar/lucene/query/test/MoarQueryTest.java#L126
> Cheers,
> Martin



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to