Regarding adding a threshold to TopFieldCollector, do you have ideas
on what it would take to fix the relevant collector/indexsearcher APIs
to make this kind of thing easier? (i know this is a doozie, but we
should at least try to think about it, maybe make some progress)

I can see where things become less efficient in this parallel+sorted
case with large top N, but there are also many other "top k
algorithms" that could be better for different use cases. in your
case, if you throw out the parallel and just think about doing your
sorted case segment-by-segment, the current code there may be
inefficient too (not as bad, but still doesn't really take total
advantage of sortedness). Maybe we improve that case by scoring some
initial "range" of docs for each/some segments first, and then handle
any "tail". With a simple google search I easily find many ideas for
how this logic could work: exact and inexact, sorted and unsorted,
distributed (parallel) and sequential.  So I think there are probably
other improvements that could be done here, but worry about what the
code might look like if we don't refactor it.

On Sun, Feb 3, 2019 at 3:14 PM Michael McCandless
<luc...@mikemccandless.com> wrote:
>
> On Sun, Feb 3, 2019 at 10:41 AM Michael Sokolov <msoko...@gmail.com> wrote:
>
>  > > In single-threaded mode we can check against minCompetitiveScore and
> > terminate collection for each segment appropriately,
> >
> > > Does Lucene do this today by default?  That should be a nice
> > optimization,
> > and it'd be safe/correct.
> >
> > Yes, it does that today (in TopFieldCollector -- see
> >
> > https://github.com/msokolov/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/search/TopFieldCollector.java#L225
> > )
> >
>
> Ahh -- great, thanks for finding that.
>
>
> > Re: our high cost of collection in static ranking phase -- that is true,
> > Mike, but I do also see a nice improvement on the luceneutil benchmark
> > (modified to have a sorted index and collect concurrently) using just a
> > vanilla TopFieldCollector. I looked at some profiler output, and it just
> > seems to be showing more time spent walking postings.
> >
>
> Yeah, understood -- I think pro-rating the N collected per segment makes a
> lot of sense.
>
> Mike McCandless
>
> http://blog.mikemccandless.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to