No - I didn't try to populate an index with real data and run real queries
(what is "real" after all?). I know from my experience of indexes with
several millions of documents where there are queries with several hundred
thousands results (one query even hit 2.5 M documents). This is typical in
search: users type on average 2.3 terms in a query. The chances you'd hit a
query with huge result set are not that small in such cases (I'm not saying
this is the most common case though, I agree that most of the searches don't
process that many documents).
However, this change will improve performance from the algorithm point of
view - you allocate as many as numRequestedHits+1 no matter how many
documents your query processes.
On Dec 10, 2007 10:59 AM, Michael McCandless <[EMAIL PROTECTED]>
wrote:
>
> Have you done any tests on real queries to see what impact this
> improvement has in practice? Or, to measure how many ScoreDocs are
> "typically" allocated?
>
> Mike
>
> Shai Erera wrote:
>
> > I agree w.r.t the current implementation, however in the worse case
> > (as we
> > tend to consider when talking about computer algorithms), it will
> > allocate a
> > ScoreDoc per result. With the overflow reuse, it will not allocate
> > those
> > objects, no matter what's the input it gets.
> > Also, notice that there is a performance hit with the current
> > implementation
> > of TopDocCollector: it first checks that the document should be
> > inserted and
> > if so, PQ does the same check again.
> > So, if you change the current implementation to always attempt an
> > insert,
> > you'd gain some performance improvements as well.
> >
> > On Dec 10, 2007 10:15 AM, Paul Elschot <[EMAIL PROTECTED]> wrote:
> >
> >> The current TopDocCollector only allocates a ScoreDoc when the given
> >> score causes a new ScoreDoc to be added into the queue, but it does
> >> not reuse anything that overflows out of the queue.
> >> So, reusing the overflow out of the queue should reduce object
> >> allocations. especially for indexes that tend to have better scoring
> >> docs at the end. I wouldn't expect a 30% improvement out of that,
> >> but it would help, if only to reduce occasional performance
> >> deteriorations.
> >>
> >> Regards,
> >> Paul Elschot
> >>
> >>
> >> On Monday 10 December 2007 08:11:50 Shai Erera wrote:
> >>> Hi
> >>>
> >>> Lucene's PQ implements two methods: put (assumes the PQ has room
> >>> for the
> >>> object) and insert (checks whether the object can be inserted
> >>> etc.). The
> >>> implementation of insert() requires the application that uses it to
> >> allocate
> >>> a new object every time it calls insert. Specifically, it cannot
> >>> reuse
> >> the
> >>> objects that were removed from the PQ.
> >>> I've done some measurements on the performance that search would
> >>> gain by
> >>> using that method, through the change of TopDocCollector.
> >>>
> >>> PriorityQueue change (added insertWithOverflow method)
> >>>
> >> ---------------------------------------------------------------------
> >> --------------
> >>> /**
> >>> * insertWithOverflow() is similar to insert(), except its
> >>> return
> >> value:
> >>> it
> >>> * returns the object (if any) that was dropped off the heap
> >>> because
> >> it
> >>> was
> >>> * full. This can be the given parameter (in case it is
> >>> smaller than
> >> the
> >>> * full heap's minimum, and couldn't be added) or another object
> >> that
> >>> was
> >>> * previously the smallest value in the heap and now has been
> >> replaced
> >>> by a
> >>> * larger one.
> >>> */
> >>> public Object insertWithOverflow(Object element) {
> >>> if (size < maxSize) {
> >>> put(element);
> >>> return null;
> >>> } else if (size > 0 && !lessThan(element, top())) {
> >>> Object ret = heap[1];
> >>> heap[1] = element;
> >>> adjustTop();
> >>> return ret;
> >>> } else {
> >>> return element;
> >>> }
> >>> }
> >>> [Very similar to insert(), only it returns the object that was
> >>> kicked
> >> out of
> >>> the Queue, or null]
> >>>
> >>> TopDocCollector's current implementation of collect()
> >>>
> >> ---------------------------------------------------------------------
> >> -------
> >>> public void collect(int doc, float score) {
> >>> if (score > 0.0f) {
> >>> totalHits++;
> >>> if (hq.size() < numHits || score >= minScore) {
> >>> hq.insert(new ScoreDoc(doc, score));
> >>> minScore = ((ScoreDoc)hq.top()).score; // maintain minScore
> >>> }
> >>> }
> >>> }
> >>> [See how it allocates a new ScoreDoc every time this method is
> >>> called]
> >>>
> >>> TopDocCollector's new implementation of collect()
> >>>
> >> ---------------------------------------------------------------------
> >> --------
> >>> public void collect(int doc, float score) {
> >>> if (score == 0.0f) {
> >>> return;
> >>> }
> >>> totalHits++;
> >>> if (hq.size() < numHits || score >= minScore) {
> >>> if (sd == null) {
> >>> sd = new ScoreDoc(doc, score);
> >>> } else {
> >>> sd.doc = doc;
> >>> sd.score = score;
> >>> }
> >>> sd = (ScoreDoc) hq.insertWithOverflow(sd);
> >>> minScore = ((ScoreDoc)hq.top()).score; // maintain
> >>> minScore
> >>> }
> >>> }
> >>> [sd is a class memeber of the collector, of type ScoreDoc]
> >>>
> >>> Now for the performance tests. I've done two tests:
> >>> 1. Calling TopDocCollector.collect 1,000,000, 10,000,000 and
> >>> 100,000,000
> >>> times using both implementations. Here are the results:
> >>> 1M 10M 100M
> >>> Current Collector 218 ms 1,907ms 20,000ms
> >>> Modified Collector 141 ms 1,297ms 12,672ms
> >>> As you can see, the new implementation 30-40% faster.
> >>>
> >>> 2. Wrote some tasks in the benchmark package that makes use of
> >>> the new
> >> PQ
> >>> while executing a real search over an index with 1M documents, of
> >>> small
> >> size
> >>> (10 characters). Here are the results:
> >>> Current TopDocCollector
> >>>
> >> ---------------------------------------------------------------------
> >> ---------------------------------------------------------------------
> >> -----
> >>> ------------> Report Sum By (any) Name (5 about 6 out of 6)
> >>> Operation round runCnt recsPerRun rec/s elapsedSec
> >>> avgUsedMem avgTotalMem
> >>> CreateIndex 0 1 1 10.6 0.09
> >>> 2,219,112 4,389,376
> >>> AddDocs_1000000 - 0 - - 1 - - 1000000 - 52,075.2 - -
> >>> 19.20 -
> >>> 34,497,176 - 52,689,408
> >>> CloseIndex 0 2 1 0.1 16.62
> >>> 26,178,972 51,378,688
> >>> OpenIndex - - - 0 - - 1 - - - - 1 - - 1,000.0 - -
> >>> 0.00 -
> >>> 48,771,304 - 50,067,968
> >>> --> MySearch 0 1 1
> >>> 5.3 0.19
> >>> 48,771,304 50,067,968
> >>>
> >>>
> >>> Modified TopDocCollector
> >>>
> >> ---------------------------------------------------------------------
> >> ---------------------------------------------------------------------
> >> -----
> >>> ------------> Report Sum By (any) Name (5 about 6 out of 6)
> >>> Operation round runCnt recsPerRun rec/s elapsedSec
> >>> avgUsedMem avgTotalMem
> >>> CreateIndex 0 1 1 10.6 0.09
> >>> 2,207,008 4,389,376
> >>> AddDocs_1000000 - 0 - - 1 - - 1000000 - 50,955.4 - -
> >>> 19.62 -
> >>> 32,531,992 - 44,825,088
> >>> CloseIndex 0 2 1 0.1 16.84
> >>> 57,853,148 61,929,984
> >>> OpenIndex - - - 0 - - 1 - - - - 1 - - 1,000.0 - -
> >>> 0.00 -
> >>> 57,792,136 - 61,929,984
> >>> --> MySearch 0 1 1
> >>> 7.1 0.14
> >>> 57,939,856 61,929,984
> >>> Notice the time difference in search (0.14 modified vs. 0.19
> >>> current).
> >> Here
> >>> too, a 30% improvement.
> >>>
> >>> One thing that I wasn't able to show, but I think it's pretty
> >>> much clear
> >> -->
> >>> the new implementation causes a lot less object allocations.
> >>> Consider
> >> the
> >>> typical search for 10 results over an index with 1M documents. The
> >> current
> >>> implementation will allocate 1M (!) ScoreDoc instances, while the
> >>> new
> >> one
> >>> will allocate 11 (10 for the PQ and 1 for reusing). On heavily
> >>> loaded
> >>> systems, this will result in far less work for the GC.
> >>>
> >>> I would like to suggest to reflect this new implementation in
> >> PriorityQueue
> >>> and also modify TopDocCollector to make use of this new method.
> >>> Several
> >> ways
> >>> to do it:
> >>> 1. Add insertWithOverflow to PQ (I hate the name and I'm willing to
> >> accept
> >>> better ones), make insert() deprecated (let's say in 2.3) and
> >>> release
> >> after
> >>> that (2.4?) rename insertWithOverflow to insert(). A complex
> >>> process,
> >> but it
> >>> won't break existing applications' code.
> >>> 2. Change insert's signature and state that applications that
> >>> move to
> >>> 2.3(for example), need to change their code as well (actually it
> >>> won't
> >>> compile).
> >>> 3. Any other alternatives?
> >>>
> >>> And .. of course, review the classes in the Lucene code that use
> >>> PQ and
> >>> modify them as necessary.
> >>>
> >>> A very long email for a very simple (but important) performance
> >> improvement.
> >>>
> >>> Shai
> >>>
> >>
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [EMAIL PROTECTED]
> >> For additional commands, e-mail: [EMAIL PROTECTED]
> >>
> >>
> >
> >
> > --
> > Regards,
> >
> > Shai Erera
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>
--
Regards,
Shai Erera