[
https://issues.apache.org/jira/browse/LUCENE-2230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12832096#action_12832096
]
Fuad Efendi edited comment on LUCENE-2230 at 2/10/10 6:22 PM:
--------------------------------------------------------------
Hi Uwe,
I am trying to study LUCENE-2258 right now...
bq. BKTree contains terms no longer available
BKTree contains objects, not terms; in my sample it contains Strings, new
BKTree<String>(new Distance()). It is a structure for fast lookup of close
objects from a set of objects, with predefined distance algorithm.
It won't hurt if String appears in BKTree structure, and corresponding Term
disappeared from Index; search results will be the same. Simply, search for
<DisappearedTerm> OR <AnotherTerm> is the same as search for <AnotherTerm>.
At least, we can run background thread which will create new BKTree instance,
without hurting end users.
Yes, Term<->String is another thing to do... I recreate fake terms in
TermEnum...
BKTree allows to iterate about 5-10% of whole structure in order to find
closest matches only if distance threshold is small, 2. If it is 4, almost no
any improvement. And, classic Levenshtein distance is slow...
Edited: trying to study LUCENE-2089...
was (Author: funtick):
Hi Uwe,
I am trying to study LUCENE-2258 right now...
bq. BKTree contains terms no longer available
BKTree contains objects, not terms; in my sample it contains Strings, new
BKTree<String>(new Distance()). It is a structure for fast lookup of close
objects from a set of objects, with predefined distance algorithm.
It won't hurt if String appears in BKTree structure, and corresponding Term
disappeared from Index; search results will be the same. Simply, search for
<DisappearedTerm> OR <AnotherTerm> is the same as search for <AnotherTerm>.
At least, we can run background thread which will create new BKTree instance,
without hurting end users.
Yes, Term<->String is another thing to do... I recreate fake terms in
TermEnum...
BKTree allows to iterate about 5-10% of whole structure in order to find
closest matches only if distance threshold is small, 2. If it is 4, almost no
any improvement. And, classic Levenshtein distance is slow...
> Lucene Fuzzy Search: BK-Tree can improve performance 3-20 times.
> ----------------------------------------------------------------
>
> Key: LUCENE-2230
> URL: https://issues.apache.org/jira/browse/LUCENE-2230
> Project: Lucene - Java
> Issue Type: Improvement
> Affects Versions: 3.0
> Environment: Lucene currently uses brute force full-terms scanner and
> calculates distance for each term. New BKTree structure improves performance
> in average 20 times when distance is 1, and 3 times when distance is 3. I
> tested with index size several millions docs, and 250,000 terms.
> New algo uses integer distances between objects.
> Reporter: Fuad Efendi
> Attachments: BKTree.java, Distance.java, DistanceImpl.java,
> FuzzyTermEnumNEW.java, FuzzyTermEnumNEW.java
>
> Original Estimate: 0.02h
> Remaining Estimate: 0.02h
>
> W. Burkhard and R. Keller. Some approaches to best-match file searching,
> CACM, 1973
> http://portal.acm.org/citation.cfm?doid=362003.362025
> I was inspired by
> http://blog.notdot.net/2007/4/Damn-Cool-Algorithms-Part-1-BK-Trees (Nick
> Johnson, Google).
> Additionally, simplified algorythm at
> http://www.catalysoft.com/articles/StrikeAMatch.html seems to be much more
> logically correct than Levenstein distance, and it is 3-5 times faster
> (isolated tests).
> Big list od distance implementations:
> http://www.dcs.shef.ac.uk/~sam/stringmetrics.htm
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]