Fuzzy search tends to be super heavy on CPU because of the Levenstein
distance algo. We use it for a small index 60MB for spell correcting and our
QPS suffers as a result.

There was recently a discussion of a new fuzzy algorithm:
https://issues.apache.org/jira/browse/LUCENE-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12661286#action_12661286

M

On Wed, Feb 18, 2009 at 5:59 AM, Varun Dhussa <va...@mapmyindia.com> wrote:

> The method suggested would make the speed faster, but I doubt whether it
> would be substantial on processors with slower clock speed. Keeping in mind
> that most processors are going multi-core, it would make sense to
> multi-thread the scan.
>
> Any remarks are welcome!
>
> Varun Dhussa
> Product Architect
> CE InfoSystems (P) Ltd
> http://www.mapmyindia.com
>
>
>
> mark harwood wrote:
>
>> I was having some thoughts recently about speeding up fuzzy search.
>>
>> The current system does edit-distance on all terms A-Z, single threaded.
>> Prefix length can reduce the search space and there is a "minimum
>> similarity" threshold but that's roughly where we are. Multithreading this
>> to make use of multiple CPUs is one option to look at but I was mainly
>> thinking about smarter ways to do the fuzzy scan:
>>
>> I had the notion that we could move to a solution where a priority queue
>> keeps the "best matches so far" and as you progress through the termEnum you
>> could bail out of edit distance calculations quickly using a rough(cheap)
>> assessment of if the current term is likely to make the cut (i.e. beat the
>> current lowest score in the priority queue). It would make sense to populate
>> the priority queue ASAP with terms that are most likely to be the best
>> matches and these will be the ones that share a reasonable length prefix.
>> As an example - searching for Obama~
>>
>> 1) Create "best matches" priority queue
>> 2) Scan all terms from oba to obz populating priority queue
>> 3) Scan all terms from "a" to "oba" and "obz" to "z", exiting quickly if
>> the term fails to meet lowest score in the priority queue.
>>
>> How we "exit quickly" and how we determine what prefix to use in 2) are to
>> be determined but the principle seems reasonable
>>
>> Thoughts?
>>
>>
>>
>>
>> ----- Original Message ----
>> From: Varun Dhussa <va...@mapmyindia.com>
>> To: java-user@lucene.apache.org
>> Sent: Wednesday, 18 February, 2009 10:36:07
>> Subject: Lucene search performance on Sun UltraSparc T2 (T5120) servers
>>
>> Hi,
>>
>> I have had a bad experience when migrating my application from Intel Xeon
>> based servers to Sun UltraSparc T2 T5120 servers. Lucene fuzzy search just
>> does not perform. A search which took approximately 500 ms takes more than 6
>> seconds to execute.
>>
>> The index has about 100,000,000 records. So, I tried to split it into 10
>> indices and used the ParallelSearcher on it, but still got similar results.
>>
>> I am guessing that this is because the distance implementation used by
>> Lucene requires higher clock speed and can't be parallelized much.
>>
>> Please advice
>>
>> -- Varun Dhussa
>> Product Architect
>> CE InfoSystems (P) Ltd
>> http://www.mapmyindia.com
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

Reply via email to