Michael McCandless wrote:
On the impact of search performance for large vs small mergeFactors, I
think the jury is still out.  People should keep testing that (and
report back!).  Certainly, for the fastest reopen time you never want
any merging to be done :)
Here is the original exchange I referenced:

>>On Fri, Apr 10, 2009 at 3:06 PM, Mark Miller <markrmil...@gmail.com> wrote: >> 24 segments is bound to be quite a bit slower than an optimized index for most things

>I'd be curious just how true this really is (in general)... my guess
>is the "long tail of tiny segments" gets into the OS's IO cache (as
>long as the system stays hot) and doesn't actually hurt things much.
>
>Has anyone tested this (performance of unoptimized vs optimized
>indexes, in general) recently?  To be a fair comparison, there should
>be no deletions in the index.
>
>Mike

After reading that, I played with some sorting code I had and did a quick cheesy test or two - one segment vs a 10 or 20. In that horrible test (based on the stress sort code), I don't remember seeing much of a difference. No sorting. Very, very unscientific, quick and dirty.

This time I loaded up 1.3 million wikipedia articles, gave the test 768MB of RAM, warmed the Searcher with lots of searching before each measurement, and compared 1 segment vs 5. The optimized index was 15-20% faster with the queries I was using (approx 100 queries targeted at wikipedia). Its an odd test system - Ubuntu, Quad core laptop with slow laptop drives and 4 gig of RAM. Still not very scientific, but better than before.


Here is the benchmark I was using in various forms:

{ "Rounds"

   ResetSystemErase

   { "Populate"
       -CreateIndex
       { "MAddDocs" AddDoc > : 15000
       -CloseIndex
   }
   { "test"
OpenReader { "WarmRdrDocs" Warm > : 50
       { "WarmRdr" Search > : 5000
       { "SearchSameRdr" Search > : 50000
       CloseReader
OpenIndex
       PrintSegmentCount
Optimize CloseIndex NewRound
   } : 2
}

RepSumByName
RepSumByPrefRound SearchSameRdr


I also did a quick profile for a 15k index, 1seg vs 10 segs. I profiled each for approx 11 million calls of readVint. The hotspot results are below.

http://myhardshadow.com/images/1seg.png
http://myhardshadow.com/images/10seg.png


Just a quick start at looking into this from over the weekend.

--
- Mark

http://www.lucidimagination.com




---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Reply via email to