Re: [jira] Updated: (LUCENE-1606) Automaton Query/Filter (scalable regex)

Ghazal Gharooni Sat, 05 Dec 2009 13:58:50 -0800

Hello,

I am new in the community and I've completely been confused. Please anybody
help me out to know which part of codes you are working with. How should I
participate in work? Thank you!





On Sat, Dec 5, 2009 at 1:02 PM, Uwe Schindler (JIRA) <[email protected]>wrote:

>
>     [
> https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel]
>
> Uwe Schindler updated LUCENE-1606:
> ----------------------------------
>
>     Attachment:     (was: LUCENE-1606-flex.patch)
>
> > Automaton Query/Filter (scalable regex)
> > ---------------------------------------
> >
> >                 Key: LUCENE-1606
> >                 URL: https://issues.apache.org/jira/browse/LUCENE-1606
> >             Project: Lucene - Java
> >          Issue Type: New Feature
> >          Components: Search
> >            Reporter: Robert Muir
> >            Assignee: Robert Muir
> >            Priority: Minor
> >             Fix For: 3.1
> >
> >         Attachments: automaton.patch, automatonMultiQuery.patch,
> automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch,
> automatonWithWildCard.patch, automatonWithWildCard2.patch,
> BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch,
> LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch,
> LUCENE-1606-flex.patch, LUCENE-1606.patch, LUCENE-1606.patch,
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch,
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch,
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch,
> LUCENE-1606_nodep.patch
> >
> >
> > Attached is a patch for an AutomatonQuery/Filter (name can change if its
> not suitable).
> > Whereas the out-of-box contrib RegexQuery is nice, I have some very large
> indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc.
> Additionally all of the existing RegexQuery implementations in Lucene are
> really slow if there is no constant prefix. This implementation does not
> depend upon constant prefix, and runs the same query in 640ms.
> > Some use cases I envision:
> >  1. lexicography/etc on large text corpora
> >  2. looking for things such as urls where the prefix is not constant
> (http:// or ftp://)
> > The Filter uses the BRICS package (http://www.brics.dk/automaton/) to
> convert regular expressions into a DFA. Then, the filter "enumerates" terms
> in a special way, by using the underlying state machine. Here is my short
> description from the comments:
> >      The algorithm here is pretty basic. Enumerate terms but instead of a
> binary accept/reject do:
> >
> >      1. Look at the portion that is OK (did not enter a reject state in
> the DFA)
> >      2. Generate the next possible String and seek to that.
> > the Query simply wraps the filter with ConstantScoreQuery.
> > I did not include the automaton.jar inside the patch but it can be
> downloaded from http://www.brics.dk/automaton/ and is BSD-licensed.
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Re: [jira] Updated: (LUCENE-1606) Automaton Query/Filter (scalable regex)

Reply via email to