[
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782198#action_12782198
]
Robert Muir edited comment on LUCENE-1606 at 11/24/09 10:04 PM:
----------------------------------------------------------------
benchmark results from mike's idea. I don't use any heuristic, just remove the
extra 'next' to show the tradeoffs.
disclaimer: against trunk with LUCENE-2075
||Pattern||Iter||AvgHits||AvgMS||AvgMS (noNext)||
|N?N?N?N|10|1000.0|37.5|28.4|
|?NNNNNN|10|10.0|6.4|6.1|
|??NNNNN|10|100.0|9.6|9.2|
|???NNNN|10|1000.0|52.7|40.9|
|????NNN|10|10000.0|300.7|262.3|
|NN??NNN|10|100.0|4.9|4.1|
|NN?N*|10|10000.0|9.6|28.9|
|?NN*|10|100000.0|80.4|235.4|
|*N|10|1000000.0|3811.6|3747.5|
|*NNNNNN|10|10.0|2098.3|2221.9|
|NNNNN??|10|100.0|2.2|2.4|
Mike my gut feeling, which will require a lot more testing, is that if the
automaton accepts a finite language (in the wildcard case, no *), we should not
do the next() call.
but more benchmarking is needed, with more patterns, especially on flex branch
to determine if this heuristic is best.
was (Author: rcmuir):
benchmark results from mike's idea. I don't use any heuristic, just remove
the extra 'next' to show the tradeoffs.
||Pattern||Iter||AvgHits||AvgMS||AvgMS (noNext)||
|N?N?N?N|10|1000.0|37.5|28.4|
|?NNNNNN|10|10.0|6.4|6.1|
|??NNNNN|10|100.0|9.6|9.2|
|???NNNN|10|1000.0|52.7|40.9|
|????NNN|10|10000.0|300.7|262.3|
|NN??NNN|10|100.0|4.9|4.1|
|NN?N*|10|10000.0|9.6|28.9|
|?NN*|10|100000.0|80.4|235.4|
|*N|10|1000000.0|3811.6|3747.5|
|*NNNNNN|10|10.0|2098.3|2221.9|
|NNNNN??|10|100.0|2.2|2.4|
Mike my gut feeling, which will require a lot more testing, is that if the
automaton accepts a finite language (in the wildcard case, no *), we should not
do the next() call.
but more benchmarking is needed, with more patterns, especially on flex branch
to determine if this heuristic is best.
> Automaton Query/Filter (scalable regex)
> ---------------------------------------
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
> Issue Type: New Feature
> Components: Search
> Reporter: Robert Muir
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch,
> automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch,
> automatonWithWildCard.patch, automatonWithWildCard2.patch,
> BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch,
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch,
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch,
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not
> suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large
> indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc.
> Additionally all of the existing RegexQuery implementations in Lucene are
> really slow if there is no constant prefix. This implementation does not
> depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
> 1. lexicography/etc on large text corpora
> 2. looking for things such as urls where the prefix is not constant (http://
> or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert
> regular expressions into a DFA. Then, the filter "enumerates" terms in a
> special way, by using the underlying state machine. Here is my short
> description from the comments:
> The algorithm here is pretty basic. Enumerate terms but instead of a
> binary accept/reject do:
>
> 1. Look at the portion that is OK (did not enter a reject state in the
> DFA)
> 2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded
> from http://www.brics.dk/automaton/ and is BSD-licensed.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]