Could one get the best of both worlds by not heapifying the PQ until it is full?
Regards, Paul Elschot Op zondag 06 december 2009 00:01:49 schreef Grant Ingersoll: > > On Dec 5, 2009, at 10:47 PM, Earwin Burrfoot wrote: > > > If someone needs all results, they know it beforehand. Why can't they > > write this collector themselves? It's trivial, just like you said. > > I'm not following your comment. Of course they can write it. But that's > true for all the implementations we provide. > > However, the Collector stuff doesn't handle the post collection sort, so, it > would require someone to hack some low level Lucene internals. Also, I think > it's interesting to think about the case of getting most of the results, but > maybe not all. > > > > > On Sun, Dec 6, 2009 at 01:22, Grant Ingersoll <gsing...@apache.org> wrote: > >> At ScaleCamp yesterday in the UK, I was listening to a talk on Xapian and > >> the speaker said one of the optimizations they do when retrieving a large > >> result set is that instead of managing a Priority Queue, they just > >> allocate a large array to hold all of the results and then sort afterward. > >> Seemed like a good idea since you could avoid a whole slew of PQ > >> operations at the cost of (possibly) some extra memory, so I thought I > >> would see if anyone has looked at doing something similar here. (Xapian > >> is C++, I believe, so it likely has different perf. characteristics which > >> may or may not matter). > >> > >> A few things come to mind: > >> 1. In many cases, when someone is asking for a lot of results, my guess is > >> they want all the results. You often see this in people who are doing > >> significant post processing on large result sets (often for machine > >> learning) > >> 2. My gut says there is actually some heuristic to be applied here, > >> whereby if the requested number of results is some percentage of the total > >> num. of hits, then do the whole results + sort, otherwise, use PQ. Thus, > >> this maybe would be useful even when doing something like, say, 500 or a > >> 1000 docs, as opposed to the bigger cases I'm thinking about. Still, > >> don't need to prematurely optimize. > >> 3. I think we could almost implement this entirely w/ a Collector, except > >> for the sort step, which presumably could be a callback (empty > >> implementation in the base class). > >> 4. When doing this, you don't bother to truncate the list when n < > >> totalHits (although see below). > >> > >> Perhaps, we could also even save having a big array for doc ids and > >> instead just flip bits in a bit set (would still need the array for > >> scores) and then materialize the bit set to an array at the end when the > >> num requested is less than the total number of hits. > >> > >> Anyone have thoughts on this? Seems fairly trivial to crank out a > >> Collector to do it, minus the post processing step, which would be > >> relatively trivial to add. > >> > >> -Grant > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org > >> For additional commands, e-mail: java-dev-h...@lucene.apache.org > >> > >> > > > > > > > > -- > > Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) > > Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 > > ICQ: 104465785 > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org > > For additional commands, e-mail: java-dev-h...@lucene.apache.org > > > > -------------------------- > Grant Ingersoll > http://www.lucidimagination.com/ > > Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using > Solr/Lucene: > http://www.lucidimagination.com/search > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-dev-h...@lucene.apache.org > > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org