[ 
https://issues.apache.org/jira/browse/LUCENE-2230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12804001#action_12804001
 ] 

Fuad Efendi edited comment on LUCENE-2230 at 1/23/10 2:49 AM:
--------------------------------------------------------------

Minor bug fixed (with cache warm-up)...

Don't forget to disable QueryResultsCache and to enable large DocumentCache (if 
you are using SOLR); otherwise you won't see the difference. (SOLR users: there 
are some other tricks!)

With Lucene 2.9.1: 
800ms

With BKTree and Levenstein algo: 
200ms

With BKTree and http://www.catalysoft.com/articles/StrikeAMatch.html
70ms

Average timing after many hours of tests. We may consider "integer" distance 
instead of "float" for Lucene Query if we accept this algo; I tried the best to 
have it close to "float" distance.

BKTree is cached at FuzzyTermEnumNEW. It needs warm-up if index changed; 
simplest way - to recalc it at night (separate thread will do it).

Thanks,
Fuad
+1 416-993-2060(cell)


      was (Author: funtick):
    Minor bug fixed (with cache wam-up)...

Don't forget to disable QueryResultsCache and to enable large DocumentCache (if 
you are using SOLR); otherwise you won't see the difference. (SOLR users: there 
are some other tricks!)

With Lucene 2.9.1: 
800ms

With BKTree and Levenstein algo: 
200ms

With BKTree and http://www.catalysoft.com/articles/StrikeAMatch.html:
70ms

Average timing after many hours of tests. We may consider "integer" distance 
instead of "float" for Lucene Query if if accept this algo; I tried the best to 
have it close to "float" distance.

BKTree is cached at FuzzyTermEnumNEW. It needs warm-up if index changed; 
simplest way - to recalc it at night (separate thread will do it).

Thanks,
Fuad
+1 416-993-2060(cell)

  
> Lucene Fuzzy Search: BK-Tree can improve performance 3-20 times.
> ----------------------------------------------------------------
>
>                 Key: LUCENE-2230
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2230
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: 3.0
>         Environment: Lucene currently uses brute force full-terms scanner and 
> calculates distance for each term. New BKTree structure improves performance 
> in average 20 times when distance is 1, and 3 times when distance is 3. I 
> tested with index size several millions docs, and 250,000 terms. 
> New algo uses integer distances between objects.
>            Reporter: Fuad Efendi
>         Attachments: BKTree.java, Distance.java, DistanceImpl.java, 
> FuzzyTermEnumNEW.java, FuzzyTermEnumNEW.java
>
>   Original Estimate: 0.02h
>  Remaining Estimate: 0.02h
>
> W. Burkhard and R. Keller. Some approaches to best-match file searching, 
> CACM, 1973
> http://portal.acm.org/citation.cfm?doid=362003.362025
> I was inspired by 
> http://blog.notdot.net/2007/4/Damn-Cool-Algorithms-Part-1-BK-Trees (Nick 
> Johnson, Google).
> Additionally, simplified algorythm at 
> http://www.catalysoft.com/articles/StrikeAMatch.html seems to be much more 
> logically correct than Levenstein distance, and it is 3-5 times faster 
> (isolated tests).
> Big list od distance implementations:
> http://www.dcs.shef.ac.uk/~sam/stringmetrics.htm

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Reply via email to