OK. Thanks for uploading the PR. I think you should definitely open an issue?

Its still worth looking at the API for this specific proposal, as this
may not return the exact top-N, correct? I think we need to be careful
and keep the user involved when it comes to inexact, not everyone's
data might meet the expected distribution and maybe others would
rather have the exact results. And of course this part of the api (the
way the user specifies such things) is way more important than how
ugly the collector code is behind the scenes. When I was looking at
this topic years ago, i looked at basically an enum being passed to
IndexSearcher with 3 possible values: exact top N with full counts
(slowest), exact top N without full counts (faster), inexact top N
(fastest). When Adrien did the actual work, the enum seemed overkill
because only 2 of the possibilties were implemented, but maybe its
worth revisiting?

On Tue, Feb 5, 2019 at 7:50 AM Michael Sokolov <msoko...@gmail.com> wrote:
>
> Hi Robert - yeah this is a complex subject. I think there's room for some
> exciting improvement though. There is some discussion in LUCENE-8675 that
> is pointing out some potential API-level problems we may have to address in
> order to make the most efficient use of the segment structure in a sorted
> index. I think generally speaking we're trying to think of ways to search
> partial segments. I don't have concrete proposals for API changes at this
> point, but it's clear there are some hurdles to be grappled with. For
> example, Adrien's point about BKD trees having a high up-front cost points
> out one difficulty. If we want to search a single segment in multiple "work
> units" (whether threaded or not), we want a way to share that up-front cost
> without needing to pay it over again for each work unit. I think a similar
> problem also occurs with some other query types (MultiTerm can produce a
> bitset I believe?).
>
> As far as the specific (prorated early termination) proposal here .. this
> is something very specific and localized within TopFieldCollector that
> doesn't require any public-facing API change or refactoring at all. It just
> terminates a little earlier based on the segment distribution. Here's a PR
> so you can see what this is: https://github.com/apache/lucene-solr/pull/564
>
>
> On Mon, Feb 4, 2019 at 8:44 AM Robert Muir <rcm...@gmail.com> wrote:
>
> > Regarding adding a threshold to TopFieldCollector, do you have ideas
> > on what it would take to fix the relevant collector/indexsearcher APIs
> > to make this kind of thing easier? (i know this is a doozie, but we
> > should at least try to think about it, maybe make some progress)
> >
> > I can see where things become less efficient in this parallel+sorted
> > case with large top N, but there are also many other "top k
> > algorithms" that could be better for different use cases. in your
> > case, if you throw out the parallel and just think about doing your
> > sorted case segment-by-segment, the current code there may be
> > inefficient too (not as bad, but still doesn't really take total
> > advantage of sortedness). Maybe we improve that case by scoring some
> > initial "range" of docs for each/some segments first, and then handle
> > any "tail". With a simple google search I easily find many ideas for
> > how this logic could work: exact and inexact, sorted and unsorted,
> > distributed (parallel) and sequential.  So I think there are probably
> > other improvements that could be done here, but worry about what the
> > code might look like if we don't refactor it.
> >
> > On Sun, Feb 3, 2019 at 3:14 PM Michael McCandless
> > <luc...@mikemccandless.com> wrote:
> > >
> > > On Sun, Feb 3, 2019 at 10:41 AM Michael Sokolov <msoko...@gmail.com>
> > wrote:
> > >
> > >  > > In single-threaded mode we can check against minCompetitiveScore and
> > > > terminate collection for each segment appropriately,
> > > >
> > > > > Does Lucene do this today by default?  That should be a nice
> > > > optimization,
> > > > and it'd be safe/correct.
> > > >
> > > > Yes, it does that today (in TopFieldCollector -- see
> > > >
> > > >
> > https://github.com/msokolov/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/search/TopFieldCollector.java#L225
> > > > )
> > > >
> > >
> > > Ahh -- great, thanks for finding that.
> > >
> > >
> > > > Re: our high cost of collection in static ranking phase -- that is
> > true,
> > > > Mike, but I do also see a nice improvement on the luceneutil benchmark
> > > > (modified to have a sorted index and collect concurrently) using just a
> > > > vanilla TopFieldCollector. I looked at some profiler output, and it
> > just
> > > > seems to be showing more time spent walking postings.
> > > >
> > >
> > > Yeah, understood -- I think pro-rating the N collected per segment makes
> > a
> > > lot of sense.
> > >
> > > Mike McCandless
> > >
> > > http://blog.mikemccandless.com
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
> >

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to