[
https://issues.apache.org/jira/browse/LUCENE-7411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15432417#comment-15432417
]
Dawid Weiss commented on LUCENE-7411:
-------------------------------------
Thanks for the clarification, Martin. As for the theoretical part -- I'd be
really curious to see how MOAs are different to automata (in general); unless
you can have some kind of recursive backreferences which would be indeed a
strange, if theoretically interesting, concept!
To me personally it seems like backreferences are a very seldom used feature.
Most people have very superficial (if any) knowledge of regular expressions;
anything going beyond a simple alternative or glob-like match is an "advanced"
scenario. Backreferences would, in my opinion, be of interest to perhaps .1% of
that advanced audience... Not that it isn't interesting -- I love automata
theory -- it's just my personal opinion that including such a feature wouldn't
be of general use. I don't see any problem in you maintaining an add-on module
to Lucene though (for those experts that need such a feature).
Speaking of automata. You mentioned Java's built-in regexes. They indeed don't
transform the input expressions into (direct) automata, but they're really,
really fast on the average case (and yes, exponentially slow in the worst
case). This is by design. Most of the time the difference will be in favor of
"fast on average" if you don't have adversarial scenarios (like users willing
to DOS your service).
I just recently had a look at Russ Cox's re2 because we needed fast regex
matching routine. There is a port of re2 to Java (simplified). The test was a
compound expression consisting of a union of several hundred (or thousands)
different (fairly trivial) regular expressions, matched against thousands of
input sequences. A pattern detection engine, basically. Java's built-in regular
expression engine was an order of magnitude (sic!) faster than anything else
available, including the C version of re2... If you can showcase your engine
to perform at the same level it'd be a splendid (research and practical) result.
> Regex Query with Backreferences
> -------------------------------
>
> Key: LUCENE-7411
> URL: https://issues.apache.org/jira/browse/LUCENE-7411
> Project: Lucene - Core
> Issue Type: New Feature
> Components: core/search
> Reporter: Martin Braun
> Priority: Minor
>
> Hi there,
> I am currently working on a Regex Engine that supports Backreferences while
> not losing determinism. It uses Memory Occurence Automata (MOAs) in the
> engine which are more powerful than normal DFA/NFAs. The engine does no
> backtracking and recognizes Regexes that cannot be evaluated
> deterministically as malformed. It has become more and more mature in the
> last few weeks and I also implemented a Lucene Query that uses these Patterns
> in the background. Now my question is: Is there any interest for this work to
> be merged (or adapted) into Lucene core?
> EDIT:
> The current state is only a mere proof of concept. The performance can
> probably be improved by a lot by adapting concepts of the Lucene Regexp
> Query. As Uwe Schindler correctly stated, the Query currently is quite "dumb"
> as in it doesn't predict what terms to match next.
> https://github.com/s4ke/moar
> Usage example for the Lucene Query:
> https://github.com/s4ke/moar/blob/master/lucene/src/test/java/com/github/s4ke/moar/lucene/query/test/MoarQueryTest.java#L126
> Cheers,
> Martin
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]