The shortlisting isn't based on stop words - a score is produced to
prioritise term selections. The score uses the IDF (inverse document
frequency) of the original term and mixes in the "edit-distance" for
each of the fuzzy variations of original terms. Care is taken to ensure
that in the query produced, fuzzy variants all use the root term's IDF
(or if the root term is not in the index the average IDF of all of the
variants is used by each variant). This avoids the rarer variants
ranking more highly than the source term in query results.
Mark
bhecht wrote:
Thanks Mark for the detailed explanation.
So one more question if I may:
How is the list shortened to to include <maxNumTerms> terms only?
In your example you had 2 stop words which of course are not included in the
token stream.
But what happens if you get more than maxNumTerms terms, how are the
maxNumTerms selected from the list?
Thanks.
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]