Re: IndexOptimizer (Re: Lucene performance bottlenecks)

Doug Cutting Wed, 14 Dec 2005 21:16:01 -0800

Andrzej Bialecki wrote:

I tested it on a 5 mln index.


Thanks, this is great data!

Can you please tell a bit more about the experiments?  In particular:

. How were scores assigned to pages? Link analysis? log(number ofincoming links) or OPIC?

 . How were the queries generated?  From a log or randomly?
 . How many queries did you test with?
 . When results differed greatly, did they look a lot worse?

My attempt to sort a 38M page index failed with OutOfMemory.  Sigh.

For MAX_HITS=1000 the performance increase was ca. 40-fold, i.e.queries, which executed in e.g. 500 ms now executed in 10-20ms(perfRate=40). Following the intuition, performance drops as we increaseMAX_HITS, until it reaches a more or less original values (perfRate=1)for MAX_HITS=300000 (for a 5 mln doc index). After that, increasingMAX_HITS actually worsens the performance (perfRate << 1) - which can beexplained by the fact that the standard HitCollector doesn't collect asmany documents, if they score too low.

This doesn't make sense to me. It should never be slower. We're notactually keeping track of any more hits, only stopping earlier.

* Two-term Nutch queries result in complex Lucene BooleanQueries overmany index fields, includng also PhraseQueries. These fared much worsethan single-term queries: actually, the topN values were very low untilMAX_HITS was increased to large values, and then all of a sudden alltopN-s flipped into the 80-90% ranges.

It would be interesting to try altering the generated query, to see ifit is the phrases or simply multiple terms which cause problems. To dothis, one could hack the query-basic plugin, or simply alter query boostparameters. This would help us figure out where the optimization isfailing. Suel used multi-term queries, but not phrases, so we expectthat the phrases are causing the problem, but it would be good to seefor certain. We've also never tuned Nutch's phrase matching, so it'salso possible that we may sometimes over-emphasize the phrase componentin scores. For example, a slop of 10 might give better results and/orbe more amenable to this optimization.

I also noticed that the values of topN depended strongly on the documentfrequency of terms in the query. For a two-term query, where both termshave average document frequency, the topN values start from ~50% for lowMAX_HITS. For a two-term query where one of the terms has a very highdocument frequency, the topN values start from 0% for low MAX_HITS. Seethe spreadsheet for details.

Were these actually useful queries? For example, I would not beconcerned if results differed greatly for a query like 'to be', sincethat's not a very useful query. Try searching for 'the the' on Google.


Thanks!

Doug

Re: IndexOptimizer (Re: Lucene performance bottlenecks)

Reply via email to