On Dec 18, 2007 2:38 AM, Mark Miller <[EMAIL PROTECTED]> wrote:

> For the data that I normally work with (short articles), I found that
> the sweet spot was around 80-120. I actually saw a slight decrease going
> above that...not sure if that held forever though. That was testing on
> an earlier release  (I think 2.1?). However, if you want to test
> searching it would seem that you are going to want to optimize the
> index. I have always found that whatever I save by changing the merge
> factor is paid back when you optimize. I have not "scientifically"
> tested this, but found it to be the case in every speed test I ran. This
> is an interesting thing to me for this test. Do you test with a full
> optimize for indexing? If you don't, can you really test the search
> performance with the advantage of a full optimize? So, if you are going
> to optimize, why mess with the merge factor? It may still play a small
> role, but at best I think its a pretty weak lever.


I had similar experience - set merge factor  to ~maxint and optimized
at the end, and "felt" like it was the same (never meassured though).
In fact, with the new concurrent merges, I think it should be faster to
merge on the fly?

(One comment - it is important to set back merge factor to a reasonable
number before the final optimize, otherwise you hit OutOfMem due to
so many segments being merged at once.)


> - Mark
>
> Grant Ingersoll wrote:
> > I did hear back from the authors.  Some of the issues were based on
> > values chosen for mergeFactor (10,000) I think, but there also seemed
> > to be some questions about parsing the TREC collection.  It was split
> > out into individual files, as opposed to trying to stream in the
> > documents like we do with Wikipedia, so I/O overhead may be an issue.
> > At the time, 1.9.1 did not have much TREC support, so splitting files
> > is probably the easiest way to do it.  There indexing code was based
> > off the demo and some LIA reading.
> >
> > They thought they would try Lucene again when 2.3 comes out.  From our
> > end, I think we need to improve the docs around mergeFactor.  We
> > generally just say bigger is better, but my understanding is there is
> > definitely a limit to this (100??  Maybe 1000) so we should probably
> > suggest that in the docs.  And, of course, I think the new
> > contrib/benchmark has support for reading TREC (although I don't know
> > if it handles streaming it) such that I think it shouldn't be a
> > problem this time around.
>

Yes it does streaming -  TREC compressed files are read with GZIPInputStream
"on demand" - next doc's text is read/parsed only when the indexer requests
it,
and the indexable document is created, no doc files are created on disk.


> >
> > At any rate, I think we are for the most part doing the right things.
> > Anyone have any thoughts on advice about an upper bound for mergeFactor?
> >
> > Cheers,
> > Grant
> >
> >
> > On Dec 10, 2007, at 2:54 PM, Mike Klaas wrote:
> >
> >> On 8-Dec-07, at 10:04 PM, Doron Cohen wrote:
> >>
> >>>> +1  I have been thinking about this too.  Solr clearly demonstrates
> >>>> the benefits of this kind of approach, although even it doesn't make
> >>>> it seamless for users in the sense that they still need to divvy up
> >>>> the docs on the app side.
> >>>
> >>> Would be nice if this layer also took care of searchers/readers
> >>> refreshing & warming.
> >>
> >> Solr has well-tested code that provides all this functionality and
> >> more (except for automatically spawning extra indexing threads, which
> >> I agree would be a useful addition).  It does heavily depend on 1.5's
> >> java.util.concurrent package, though.  Many people seem like using
> >> Solr as an embedded library layer on top of Lucene to do it all
> >> in-process, as well.
> >>
> >> -Mike
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [EMAIL PROTECTED]
> >> For additional commands, e-mail: [EMAIL PROTECTED]
> >>
> >
> > --------------------------
> > Grant Ingersoll
> > http://lucene.grantingersoll.com
> >
> > Lucene Helpful Hints:
> > http://wiki.apache.org/lucene-java/BasicsOfPerformance
> > http://wiki.apache.org/lucene-java/LuceneFAQ
> >
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

Reply via email to