On Dec 18, 2007 2:38 AM, Mark Miller <[EMAIL PROTECTED]> wrote: > For the data that I normally work with (short articles), I found that > the sweet spot was around 80-120. I actually saw a slight decrease going > above that...not sure if that held forever though. That was testing on > an earlier release (I think 2.1?). However, if you want to test > searching it would seem that you are going to want to optimize the > index. I have always found that whatever I save by changing the merge > factor is paid back when you optimize. I have not "scientifically" > tested this, but found it to be the case in every speed test I ran. This > is an interesting thing to me for this test. Do you test with a full > optimize for indexing? If you don't, can you really test the search > performance with the advantage of a full optimize? So, if you are going > to optimize, why mess with the merge factor? It may still play a small > role, but at best I think its a pretty weak lever.
I had similar experience - set merge factor to ~maxint and optimized at the end, and "felt" like it was the same (never meassured though). In fact, with the new concurrent merges, I think it should be faster to merge on the fly? (One comment - it is important to set back merge factor to a reasonable number before the final optimize, otherwise you hit OutOfMem due to so many segments being merged at once.) > - Mark > > Grant Ingersoll wrote: > > I did hear back from the authors. Some of the issues were based on > > values chosen for mergeFactor (10,000) I think, but there also seemed > > to be some questions about parsing the TREC collection. It was split > > out into individual files, as opposed to trying to stream in the > > documents like we do with Wikipedia, so I/O overhead may be an issue. > > At the time, 1.9.1 did not have much TREC support, so splitting files > > is probably the easiest way to do it. There indexing code was based > > off the demo and some LIA reading. > > > > They thought they would try Lucene again when 2.3 comes out. From our > > end, I think we need to improve the docs around mergeFactor. We > > generally just say bigger is better, but my understanding is there is > > definitely a limit to this (100?? Maybe 1000) so we should probably > > suggest that in the docs. And, of course, I think the new > > contrib/benchmark has support for reading TREC (although I don't know > > if it handles streaming it) such that I think it shouldn't be a > > problem this time around. > Yes it does streaming - TREC compressed files are read with GZIPInputStream "on demand" - next doc's text is read/parsed only when the indexer requests it, and the indexable document is created, no doc files are created on disk. > > > > At any rate, I think we are for the most part doing the right things. > > Anyone have any thoughts on advice about an upper bound for mergeFactor? > > > > Cheers, > > Grant > > > > > > On Dec 10, 2007, at 2:54 PM, Mike Klaas wrote: > > > >> On 8-Dec-07, at 10:04 PM, Doron Cohen wrote: > >> > >>>> +1 I have been thinking about this too. Solr clearly demonstrates > >>>> the benefits of this kind of approach, although even it doesn't make > >>>> it seamless for users in the sense that they still need to divvy up > >>>> the docs on the app side. > >>> > >>> Would be nice if this layer also took care of searchers/readers > >>> refreshing & warming. > >> > >> Solr has well-tested code that provides all this functionality and > >> more (except for automatically spawning extra indexing threads, which > >> I agree would be a useful addition). It does heavily depend on 1.5's > >> java.util.concurrent package, though. Many people seem like using > >> Solr as an embedded library layer on top of Lucene to do it all > >> in-process, as well. > >> > >> -Mike > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: [EMAIL PROTECTED] > >> For additional commands, e-mail: [EMAIL PROTECTED] > >> > > > > -------------------------- > > Grant Ingersoll > > http://lucene.grantingersoll.com > > > > Lucene Helpful Hints: > > http://wiki.apache.org/lucene-java/BasicsOfPerformance > > http://wiki.apache.org/lucene-java/LuceneFAQ > > > > > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > >