[
https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Robert Muir updated LUCENE-2089:
--------------------------------
Attachment: LUCENE-2089.patch
updated patch:
* when the user supplies a constant prefix, it is actually used for speeding up
the automaton enum (less seeks). it is simply DFA-concatenated with the Lev DFA
for the non-prefix part.
i will upload a separate patch that improves BasicOperations.concatenate to
allow this without creating an intermediate NFA.
> explore using automaton for fuzzyquery
> --------------------------------------
>
> Key: LUCENE-2089
> URL: https://issues.apache.org/jira/browse/LUCENE-2089
> Project: Lucene - Java
> Issue Type: Wish
> Components: Search
> Affects Versions: Flex Branch
> Reporter: Robert Muir
> Assignee: Mark Miller
> Priority: Minor
> Fix For: Flex Branch
>
> Attachments: LUCENE-2089.patch, LUCENE-2089.patch, LUCENE-2089.patch,
> LUCENE-2089.patch, Moman-0.2.1.tar.gz, TestFuzzy.java
>
>
> Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is
> itching to write that nasty algorithm)
> we can optimize fuzzyquery by using AutomatonTermsEnum, here is my idea
> * up front, calculate the maximum required K edits needed to match the users
> supplied float threshold.
> * for at least small common E up to some max K (1,2,3, etc) we should create
> a DFA for each E.
> if the required E is above our supported max, we use "dumb mode" at first (no
> seeking, no DFA, just brute force like now).
> As the pq fills, we swap progressively lower DFAs into the enum, based upon
> the lowest score in the pq.
> This should work well on avg, at high E, you will typically fill the pq very
> quickly since you will match many terms.
> This not only provides a mechanism to switch to more efficient DFAs during
> enumeration, but also to switch from "dumb mode" to "smart mode".
> i modified my wildcard benchmark to generate random fuzzy queries.
> * Pattern: 7N stands for NNNNNNN, etc.
> * AvgMS_DFA: this is the time spent creating the automaton (constructor)
> ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
> |7N|10|64.0|4155.9|38.6|20.3|
> |14N|10|0.0|2511.6|46.0|37.9|
> |28N|10|0.0|2506.3|93.0|86.6|
> |56N|10|0.0|2524.5|304.4|298.5|
> as you can see, this prototype is no good yet, because it creates the DFA in
> a slow way. right now it creates an NFA, and all this wasted time is in
> NFA->DFA conversion.
> So, for a very long string, it just gets worse and worse. This has nothing to
> do with lucene, and here you can see, the TermEnum is fast (AvgMS -
> AvgMS_DFA), there is no problem there.
> instead we should just build a DFA to begin with, maybe with this paper:
> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
> we can precompute the tables with that algorithm up to some reasonable K, and
> then I think we are ok.
> the paper references using http://portal.acm.org/citation.cfm?id=135907 for
> linear minimization, if someone wants to implement this they should not worry
> about minimization.
> in fact, we need to at some point determine if AutomatonQuery should even
> minimize FSM's at all, or if it is simply enough for them to be deterministic
> with no transitions to dead states. (The only code that actually assumes
> minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a
> summation easily). we need to benchmark really complex DFAs (i.e. write a
> regex benchmark) to figure out if minimization is even helping right now.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]