David Spencer wrote:
Does anyone know how much stop words are supposed to affect the index size?

I did an experiment of building an index once with, and once without, stop words.

The corpus is the English Wikipedia, and I indexed the title and body of the articles. I used a list of 525 stop words.

With stopwords removed the index is 227MB.
With stopwords kept the index is 331MB.

The unstopped version is indeed bigger and slower to build, but it's only slower to search when folks search on stop words. One approach to minimizing stopwords in searches (used by, e.g. Nutch & Google) is to index all stop words but remove them from queries unless they're (a) in a phrase or (b) explicitly required with a "+". (It might be nice if Lucene included a query parser that had this feature.)


Nutch also optimizes phrase searches involving a few very common stop words (e.g., "the", "a", "to") by indexing these as bigrams and converting phrases involving them to bigram phrases. So, if someone searches for "to be or not to be" then this turns into a search for "to-be be or not-to to-be" which is considerably faster since it involves rarer terms. But the more words you bigram the bigger the index gets and the slower updates get, so you probably can't afford to do this for your full stop list. (It might be nice if Lucene included support for this technique too!)

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Reply via email to