Hi Eks Dev, thanks for the comment! I like this: > If you need too use this, nothing simpler, you do not even need pair > comparison (aka traversal), just Index terms split into bigrams and > search with standard Query.
And, I need to study all the tricks of Automaton... -Fuad > -----Original Message----- > From: Eks Dev (JIRA) [mailto:j...@apache.org] > Sent: February-11-10 5:04 AM > To: java-dev@lucene.apache.org > Subject: [jira] Commented: (LUCENE-2089) explore using automaton for > fuzzyquery > > > [ https://issues.apache.org/jira/browse/LUCENE- > 2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment- > tabpanel&focusedCommentId=12832424#action_12832424 ] > > Eks Dev commented on LUCENE-2089: > --------------------------------- > > {quote} > What about this, > http://www.catalysoft.com/articles/StrikeAMatch.html > it seems logically more appropriate to (human-entered) text objects than > Levenshtein distance, and it is (in theory) extremely fast; is DFA- > distance faster? > {quote} > > Is that only me who sees plain, vanilla bigram distance here? What is > new or better in StrikeAMatch compared to the first phase of the current > SpellCehcker (feeding PriorityQueue with candidates)? > > If you need too use this, nothing simpler, you do not even need pair > comparison (aka traversal), just Index terms split into bigrams and > search with standard Query. > > > Autmaton trick is a neat one. Imo, the only thing that would work > better is to make term dictionary real trie (ternary, n-ary, dfa, makes > no big diff). Making TerrmDict some sort of trie/dfa would permit smart > beam-search, even without compiling query DFA. Beam search also makes > implementation of better distances possible (Weighted Edit distance > without "metric constraint" ). I guess this is going to be possible with > Flex, Mike was allready talking about DFA Dictionary :) > > It took a while to figure out the trick Robert pooled here, treating > term dictionary as another DFA due to the sortedness, nice. > > > explore using automaton for fuzzyquery > > -------------------------------------- > > > > Key: LUCENE-2089 > > URL: https://issues.apache.org/jira/browse/LUCENE-2089 > > Project: Lucene - Java > > Issue Type: Wish > > Components: Search > > Reporter: Robert Muir > > Assignee: Mark Miller > > Priority: Minor > > Attachments: LUCENE-2089.patch, Moman-0.2.1.tar.gz, > TestFuzzy.java > > > > > > Mark brought this up on LUCENE-1606 (i will assign this to him, I know > he is itching to write that nasty algorithm) > > we can optimize fuzzyquery by using AutomatonTermsEnum, here is my > idea > > * up front, calculate the maximum required K edits needed to match the > users supplied float threshold. > > * for at least small common E up to some max K (1,2,3, etc) we should > create a DFA for each E. > > if the required E is above our supported max, we use "dumb mode" at > first (no seeking, no DFA, just brute force like now). > > As the pq fills, we swap progressively lower DFAs into the enum, based > upon the lowest score in the pq. > > This should work well on avg, at high E, you will typically fill the > pq very quickly since you will match many terms. > > This not only provides a mechanism to switch to more efficient DFAs > during enumeration, but also to switch from "dumb mode" to "smart mode". > > i modified my wildcard benchmark to generate random fuzzy queries. > > * Pattern: 7N stands for NNNNNNN, etc. > > * AvgMS_DFA: this is the time spent creating the automaton > (constructor) > > ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA|| > > |7N|10|64.0|4155.9|38.6|20.3| > > |14N|10|0.0|2511.6|46.0|37.9| > > |28N|10|0.0|2506.3|93.0|86.6| > > |56N|10|0.0|2524.5|304.4|298.5| > > as you can see, this prototype is no good yet, because it creates the > DFA in a slow way. right now it creates an NFA, and all this wasted time > is in NFA->DFA conversion. > > So, for a very long string, it just gets worse and worse. This has > nothing to do with lucene, and here you can see, the TermEnum is fast > (AvgMS - AvgMS_DFA), there is no problem there. > > instead we should just build a DFA to begin with, maybe with this > paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652 > > we can precompute the tables with that algorithm up to some reasonable > K, and then I think we are ok. > > the paper references using > http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if > someone wants to implement this they should not worry about > minimization. > > in fact, we need to at some point determine if AutomatonQuery should > even minimize FSM's at all, or if it is simply enough for them to be > deterministic with no transitions to dead states. (The only code that > actually assumes minimal DFA is the "Dumb" vs "Smart" heuristic and this > can be rewritten as a summation easily). we need to benchmark really > complex DFAs (i.e. write a regex benchmark) to figure out if > minimization is even helping right now. > > -- > This message is automatically generated by JIRA. > - > You can reply to this email to add a comment to the issue online. > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-dev-h...@lucene.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org