[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)

Robert Muir (JIRA) Tue, 24 Nov 2009 12:58:03 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782159#action_12782159
 ]


Robert Muir commented on LUCENE-1606:
-------------------------------------

{quote}
I think I'm confused - if the query is ???1234, the common suffix is
1234, and so shouldn't the DFA tell us the next XXX1234 term to try to
seek to (and we should never use next() on the enum)?
{quote}

not really the suffix, but more general, the structure of the dfa itself tells 
us that for ???1234, if you are evaluating 5551234, that the next term should 
really be 5561234
you can tell this by basically 'walking the graph'

but this knowledge is not always available, sometimes we only have 'partial' 
knowledge of where to go next.
The culprit here is the * operator, because it eats up anything :)
So when you walk the graph, the * operator is this giant monster in your way. 

so sometimes, depending on the dfa (which depends on the wildcard or regex used 
to construct it), 
automaton knows enough to seek to all the exact locations, if its finite (like 
???1234)

sometimes, it only knows partial information. when a loop is encountered 
walking the graph (the giant monster), it has to stop and only use the path 
information it knows so far.
for example a wildcard of abcd*1234, the best it can do is seek to abcd.

your description of ping-pong is right, this is how these situations are 
handled.

right now, the way the enum works, is i don't even bother seeking until i hit a 
mismatch.
you can see this in the comments 'as long as there is a match, keep reading. 
this is an optimization when many terms are matched sequentially like ab*'

i tested this along time ago, perhaps we should re-test to see if its 
appropriate?


> Automaton Query/Filter (scalable regex)
> ---------------------------------------
>
>                 Key: LUCENE-1606
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1606
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Search
>            Reporter: Robert Muir
>            Assignee: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: automaton.patch, automatonMultiQuery.patch, 
> automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, 
> automatonWithWildCard.patch, automatonWithWildCard2.patch, 
> BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
> suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
> indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
> Additionally all of the existing RegexQuery implementations in Lucene are 
> really slow if there is no constant prefix. This implementation does not 
> depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
>  1. lexicography/etc on large text corpora
>  2. looking for things such as urls where the prefix is not constant (http:// 
> or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
> regular expressions into a DFA. Then, the filter "enumerates" terms in a 
> special way, by using the underlying state machine. Here is my short 
> description from the comments:
>      The algorithm here is pretty basic. Enumerate terms but instead of a 
> binary accept/reject do:
>       
>      1. Look at the portion that is OK (did not enter a reject state in the 
> DFA)
>      2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded 
> from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)

Reply via email to