David Spencer wrote:
[c] "interesting words" - uses code from MoreLikeThis to give a table of all interesting
words in the current "source" doc ordered by score.
Remember score is idf*tf as per Dougs mail (and as per my
hopefully correct understanding of these things). This page is of course more of a debugging
tool that something a normal user would see. One possible area of improvement that jumped out at me after reviewing this table is using stemming, say, allowing more words in the generated query when 2 words have the same stem.
Actually, the analyzer should do that, shouldn't it? For example, I have stemming analyzers for a variety of languages that both stem and remove stop words - it seems silly to me to duplicate that functionality when it's so easily provided by the analyzer. Given that, I would suggest removing the stop word functionality from this class
Actually I realized this is a trickly and possibly counterintuitive issue.
In theory one might want the MoreLikeThis logic to use a *larger* stop word list than the Analyzer uses, even in the case where the Analyzer does not use any stop word list.
Reasoning is:
-- maybe you don't want Analyzer to have any stop words (so user can find the classic "to be or not to be" phrase) and the search index compression won't (in theory?) be affected by frequent stop words anyway
-- the stop words used by MoreLikeThis are a heuristic with 2 points behind them - the obvious (stop words
are not interesting in similarity) and the fact that they're there to minimize the expensive IndexReader.docFreq() calls, thus more stop words are fine to reduce docFreq() calls and let the query generator run faster
As an aside I sometimes use a list of ~500 English stop words from "SMART" (sorry, can't easily find the ref, though this might be close: http://citeseer.nj.nec.com/context/45797/0 ). I can contribute these if wanted.
as it is not needed and only confuses things.
Regards,
Bruce Ritchie http://www.jivesoftware.com/
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]