Does anyone know how much stop words are supposed to affect the index size?

I did an experiment of building an index once with, and once without, stop words.

The corpus is the English Wikipedia, and I indexed the title and body of the articles. I used a list of 525 stop words.

With stopwords removed the index is 227MB.
With stopwords kept the index is 331MB.

Thus, the index grows by 45% in this case, which I found suprising, as I expected it to not grow as much. I haven't dug into the details of the Lucene file formats but thought compression (field/term vector/sparse lists/ vints) would negate the affect of stopwords to a large extent.

Some more details + a link to my stopword list are here:
http://www.searchmorph.com/weblog/index.php?id=36

-- Dave

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Reply via email to