Does anyone know how much stop words are supposed to affect the index size?
I did an experiment of building an index once with, and once without, stop words.
The corpus is the English Wikipedia, and I indexed the title and body of the articles. I used a list of 525 stop words.
With stopwords removed the index is 227MB. With stopwords kept the index is 331MB.
Thus, the index grows by 45% in this case, which I found suprising, as I expected it to not grow as much. I haven't dug into the details of the Lucene file formats but thought compression (field/term vector/sparse lists/ vints) would negate the affect of stopwords to a large extent.
Some more details + a link to my stopword list are here: http://www.searchmorph.com/weblog/index.php?id=36
-- Dave
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]