[
https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12790368#action_12790368
]
Mark Miller commented on LUCENE-2089:
-------------------------------------
bq. If you do take hold of it, do not hesitate to share The original paper and
C++ code likewise melt my brain, and I needed the algo in some other place.
The java impl I was onto was about 75% complete according to the author, but I
have not yet looked at the code. Robert was convinced it was a different less
efficient algorithm last I heard though.
We have cracked much of the paper - thats how Robert implemented n=1 here -
thats from the paper. The next step is to work out how to construct the tables
for n as Robert says above. And store those tables efficiently as they start
getting quite large rather fast - though we might only use as high as n=3 or 4
in Lucene - Robert suspects term seeking will outweigh any gains at that point.
I think we know how to do the majority of the work for the n case, but I don't
really have much/any time for this, so it probably depends on if/when Robert
gets to it. If he loses interest on finishing, I def plan to come back to it
someday. I'd like to complete my understanding of the paper and see a full n
java impl of this in either case. The main piece left that I don't understand
fully (computing all possible states for n), can be computed with just a brute
force check (thats how the python impl is doing it), so there may not be much
more to understand. I would like to know how the paper is getting 'i'
parametrized state generators though - thats much more efficient. The paper
shows them for n=1 and n=2.
> explore using automaton for fuzzyquery
> --------------------------------------
>
> Key: LUCENE-2089
> URL: https://issues.apache.org/jira/browse/LUCENE-2089
> Project: Lucene - Java
> Issue Type: Wish
> Components: Search
> Reporter: Robert Muir
> Assignee: Mark Miller
> Priority: Minor
> Attachments: LUCENE-2089.patch, Moman-0.2.1.tar.gz, TestFuzzy.java
>
>
> Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is
> itching to write that nasty algorithm)
> we can optimize fuzzyquery by using AutomatonTermsEnum, here is my idea
> * up front, calculate the maximum required K edits needed to match the users
> supplied float threshold.
> * for at least small common E up to some max K (1,2,3, etc) we should create
> a DFA for each E.
> if the required E is above our supported max, we use "dumb mode" at first (no
> seeking, no DFA, just brute force like now).
> As the pq fills, we swap progressively lower DFAs into the enum, based upon
> the lowest score in the pq.
> This should work well on avg, at high E, you will typically fill the pq very
> quickly since you will match many terms.
> This not only provides a mechanism to switch to more efficient DFAs during
> enumeration, but also to switch from "dumb mode" to "smart mode".
> i modified my wildcard benchmark to generate random fuzzy queries.
> * Pattern: 7N stands for NNNNNNN, etc.
> * AvgMS_DFA: this is the time spent creating the automaton (constructor)
> ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
> |7N|10|64.0|4155.9|38.6|20.3|
> |14N|10|0.0|2511.6|46.0|37.9|
> |28N|10|0.0|2506.3|93.0|86.6|
> |56N|10|0.0|2524.5|304.4|298.5|
> as you can see, this prototype is no good yet, because it creates the DFA in
> a slow way. right now it creates an NFA, and all this wasted time is in
> NFA->DFA conversion.
> So, for a very long string, it just gets worse and worse. This has nothing to
> do with lucene, and here you can see, the TermEnum is fast (AvgMS -
> AvgMS_DFA), there is no problem there.
> instead we should just build a DFA to begin with, maybe with this paper:
> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
> we can precompute the tables with that algorithm up to some reasonable K, and
> then I think we are ok.
> the paper references using http://portal.acm.org/citation.cfm?id=135907 for
> linear minimization, if someone wants to implement this they should not worry
> about minimization.
> in fact, we need to at some point determine if AutomatonQuery should even
> minimize FSM's at all, or if it is simply enough for them to be deterministic
> with no transitions to dead states. (The only code that actually assumes
> minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a
> summation easily). we need to benchmark really complex DFAs (i.e. write a
> regex benchmark) to figure out if minimization is even helping right now.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]