The shortlisting isn't based on stop words - a score is produced to prioritise term selections. The score uses the IDF (inverse document frequency) of the original term and mixes in the "edit-distance" for each of the fuzzy variations of original terms. Care is taken to ensure that in the query produced, fuzzy variants all use the root term's IDF (or if the root term is not in the index the average IDF of all of the variants is used by each variant). This avoids the rarer variants ranking more highly than the source term in query results.

Mark

bhecht wrote:
Thanks Mark for the detailed explanation.
So one more question if I may:
How is the list shortened to to include <maxNumTerms> terms only?
In your example you had 2 stop words which of course are not included in the
token stream.
But what happens if you get more than maxNumTerms terms, how are the
maxNumTerms selected from the list?
Thanks.



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to