Re: Lots of results

Michael McCandless Sun, 06 Dec 2009 03:40:53 -0800

I think a hybrid approach may be worth exploring as well.  It'd let
you trade off how much transient RAM you're willing to spend...


Ie, rather than insisting on heapifying after every insertion, allow
many insertions to arrive and then use a selection/partition algorithm
to periodically prune.

So if your want 10K results, but you're willing to tolerate say 2X the
RAM usage, you could eg allow queue to fill until 20K. When you hit
that 20K mark, you use selection algorithm to find a more accurate
bottom, ie allowing you to prune away the 10K results that are not in
fact competitive.  This also defines your new "bottom". For any new
incoming results that come in, if they beat your current bottom, you
accumulate them, up until 20K again, at which point you repeat.

Then only at the end you sort the results, which would probably be an
optional operation -- ie, perhaps you simply want the top 10K hits but
you don't need them to be sorted, if you'll do some other
post-processing.

Mike

On Sat, Dec 5, 2009 at 6:34 PM, Paul Elschot <[email protected]> wrote:
> Could one get the best of both worlds by not heapifying the PQ
> until it is full?
>
> Regards,
> Paul Elschot
>
> Op zondag 06 december 2009 00:01:49 schreef Grant Ingersoll:
>>
>> On Dec 5, 2009, at 10:47 PM, Earwin Burrfoot wrote:
>>
>> > If someone needs all results, they know it beforehand. Why can't they
>> > write this collector themselves? It's trivial, just like you said.
>>
>> I'm not following your comment.  Of course they can write it.  But that's 
>> true for all the implementations we provide.
>>
>> However, the Collector stuff doesn't handle the post collection sort, so, it 
>> would require someone to hack some low level Lucene internals.  Also, I 
>> think it's interesting to think about the case of getting most of the 
>> results, but maybe not all.
>>
>> >
>> > On Sun, Dec 6, 2009 at 01:22, Grant Ingersoll <[email protected]> wrote:
>> >> At ScaleCamp yesterday in the UK, I was listening to a talk on Xapian and 
>> >> the speaker said one of the optimizations they do when retrieving a large 
>> >> result set is that instead of managing a Priority Queue, they just 
>> >> allocate a large array to hold all of the results and then sort 
>> >> afterward.   Seemed like a good idea since you could avoid a whole slew 
>> >> of PQ operations at the cost of (possibly) some extra memory, so I 
>> >> thought I would see if anyone has looked at doing something similar here. 
>> >>  (Xapian is C++, I believe, so it likely has different perf. 
>> >> characteristics which may or may not matter).
>> >>
>> >> A few things come to mind:
>> >> 1. In many cases, when someone is asking for a lot of results, my guess 
>> >> is they want all the results.  You often see this in people who are doing 
>> >> significant post processing on large result sets (often for machine 
>> >> learning)
>> >> 2. My gut says there is actually some heuristic to be applied here, 
>> >> whereby if the requested number of results is some percentage of the 
>> >> total num. of hits, then do the whole results + sort, otherwise, use PQ.  
>> >> Thus, this maybe would be useful even when doing something like, say, 500 
>> >> or a 1000 docs, as opposed to the bigger cases I'm thinking about.  
>> >> Still, don't need to prematurely optimize.
>> >> 3.  I think we could almost implement this entirely w/ a Collector, 
>> >> except for the sort step, which presumably could be a callback (empty 
>> >> implementation in the base class).
>> >> 4. When doing this, you don't bother to truncate the list when n < 
>> >> totalHits (although see below).
>> >>
>> >> Perhaps, we could also even save having a big array for doc ids and 
>> >> instead just flip bits in a bit set (would still need the array for 
>> >> scores) and then materialize the bit set to an array at the end when the 
>> >> num requested is less than the total number of hits.
>> >>
>> >> Anyone have thoughts on this?  Seems fairly trivial to crank out a 
>> >> Collector to do it, minus the post processing step, which would be 
>> >> relatively trivial to add.
>> >>
>> >> -Grant
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: [email protected]
>> >> For additional commands, e-mail: [email protected]
>> >>
>> >>
>> >
>> >
>> >
>> > --
>> > Kirill Zakharenko/Кирилл Захаренко ([email protected])
>> > Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
>> > ICQ: 104465785
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: [email protected]
>> > For additional commands, e-mail: [email protected]
>> >
>>
>> --------------------------
>> Grant Ingersoll
>> http://www.lucidimagination.com/
>>
>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using 
>> Solr/Lucene:
>> http://www.lucidimagination.com/search
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Lots of results

Reply via email to