Re: [jira] Updated: (LUCENE-1606) Automaton Query/Filter (scalable regex)

Michael McCandless Mon, 15 Jun 2009 09:21:04 -0700

Why do you hate 1.6 Uwe?

Mike


On Mon, Jun 15, 2009 at 12:10 PM, Uwe Schindler (JIRA)<[email protected]> wrote:
>
>     [ 
> https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
>  ]
>
> Uwe Schindler updated LUCENE-1606:
> ----------------------------------
>
>    Fix Version/s:     (was: 2.9)
>                   3.0
>
> I move this to 3.0 (and not 3.1), because it can be released together with 
> 3.0 (contrib modules do not need to wait until 3.1).
>
> Robert: you could supply a patch with StringBuilder toString() variants and 
> all those @Override uncommented-in. And it works correct with 1.5 (I am 
> working with 1.5 here locally - I hate 1.6...).
>
>> Automaton Query/Filter (scalable regex)
>> ---------------------------------------
>>
>>                 Key: LUCENE-1606
>>                 URL: https://issues.apache.org/jira/browse/LUCENE-1606
>>             Project: Lucene - Java
>>          Issue Type: New Feature
>>          Components: contrib/*
>>            Reporter: Robert Muir
>>            Assignee: Uwe Schindler
>>            Priority: Minor
>>             Fix For: 3.0
>>
>>         Attachments: automaton.patch, automatonMultiQuery.patch, 
>> automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, 
>> automatonWithWildCard.patch, automatonWithWildCard2.patch, LUCENE-1606.patch
>>
>>
>> Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
>> suitable).
>> Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
>> indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
>> Additionally all of the existing RegexQuery implementations in Lucene are 
>> really slow if there is no constant prefix. This implementation does not 
>> depend upon constant prefix, and runs the same query in 640ms.
>> Some use cases I envision:
>>  1. lexicography/etc on large text corpora
>>  2. looking for things such as urls where the prefix is not constant 
>> (http:// or ftp://)
>> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to 
>> convert regular expressions into a DFA. Then, the filter "enumerates" terms 
>> in a special way, by using the underlying state machine. Here is my short 
>> description from the comments:
>>      The algorithm here is pretty basic. Enumerate terms but instead of a 
>> binary accept/reject do:
>>
>>      1. Look at the portion that is OK (did not enter a reject state in the 
>> DFA)
>>      2. Generate the next possible String and seek to that.
>> the Query simply wraps the filter with ConstantScoreQuery.
>> I did not include the automaton.jar inside the patch but it can be 
>> downloaded from http://www.brics.dk/automaton/ and is BSD-licensed.
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [jira] Updated: (LUCENE-1606) Automaton Query/Filter (scalable regex)

Reply via email to