Could one get the best of both worlds by not heapifying the PQ
until it is full?

Regards,
Paul Elschot

Op zondag 06 december 2009 00:01:49 schreef Grant Ingersoll:
> 
> On Dec 5, 2009, at 10:47 PM, Earwin Burrfoot wrote:
> 
> > If someone needs all results, they know it beforehand. Why can't they
> > write this collector themselves? It's trivial, just like you said.
> 
> I'm not following your comment.  Of course they can write it.  But that's 
> true for all the implementations we provide.
> 
> However, the Collector stuff doesn't handle the post collection sort, so, it 
> would require someone to hack some low level Lucene internals.  Also, I think 
> it's interesting to think about the case of getting most of the results, but 
> maybe not all.
> 
> > 
> > On Sun, Dec 6, 2009 at 01:22, Grant Ingersoll <gsing...@apache.org> wrote:
> >> At ScaleCamp yesterday in the UK, I was listening to a talk on Xapian and 
> >> the speaker said one of the optimizations they do when retrieving a large 
> >> result set is that instead of managing a Priority Queue, they just 
> >> allocate a large array to hold all of the results and then sort afterward. 
> >>   Seemed like a good idea since you could avoid a whole slew of PQ 
> >> operations at the cost of (possibly) some extra memory, so I thought I 
> >> would see if anyone has looked at doing something similar here.  (Xapian 
> >> is C++, I believe, so it likely has different perf. characteristics which 
> >> may or may not matter).
> >> 
> >> A few things come to mind:
> >> 1. In many cases, when someone is asking for a lot of results, my guess is 
> >> they want all the results.  You often see this in people who are doing 
> >> significant post processing on large result sets (often for machine 
> >> learning)
> >> 2. My gut says there is actually some heuristic to be applied here, 
> >> whereby if the requested number of results is some percentage of the total 
> >> num. of hits, then do the whole results + sort, otherwise, use PQ.  Thus, 
> >> this maybe would be useful even when doing something like, say, 500 or a 
> >> 1000 docs, as opposed to the bigger cases I'm thinking about.  Still, 
> >> don't need to prematurely optimize.
> >> 3.  I think we could almost implement this entirely w/ a Collector, except 
> >> for the sort step, which presumably could be a callback (empty 
> >> implementation in the base class).
> >> 4. When doing this, you don't bother to truncate the list when n < 
> >> totalHits (although see below).
> >> 
> >> Perhaps, we could also even save having a big array for doc ids and 
> >> instead just flip bits in a bit set (would still need the array for 
> >> scores) and then materialize the bit set to an array at the end when the 
> >> num requested is less than the total number of hits.
> >> 
> >> Anyone have thoughts on this?  Seems fairly trivial to crank out a 
> >> Collector to do it, minus the post processing step, which would be 
> >> relatively trivial to add.
> >> 
> >> -Grant
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> >> For additional commands, e-mail: java-dev-h...@lucene.apache.org
> >> 
> >> 
> > 
> > 
> > 
> > -- 
> > Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
> > Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
> > ICQ: 104465785
> > 
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-dev-h...@lucene.apache.org
> > 
> 
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
> 
> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using 
> Solr/Lucene:
> http://www.lucidimagination.com/search
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
> 
> 
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Reply via email to