OK - I opened https://issues.apache.org/jira/browse/LUCENE-8681 and talked
about the possible user knobs I think we could provide.

On Tue, Feb 5, 2019 at 9:10 AM Robert Muir <rcm...@gmail.com> wrote:

> OK. Thanks for uploading the PR. I think you should definitely open an
> issue?
>
> Its still worth looking at the API for this specific proposal, as this
> may not return the exact top-N, correct? I think we need to be careful
> and keep the user involved when it comes to inexact, not everyone's
> data might meet the expected distribution and maybe others would
> rather have the exact results. And of course this part of the api (the
> way the user specifies such things) is way more important than how
> ugly the collector code is behind the scenes. When I was looking at
> this topic years ago, i looked at basically an enum being passed to
> IndexSearcher with 3 possible values: exact top N with full counts
> (slowest), exact top N without full counts (faster), inexact top N
> (fastest). When Adrien did the actual work, the enum seemed overkill
> because only 2 of the possibilties were implemented, but maybe its
> worth revisiting?
>
> On Tue, Feb 5, 2019 at 7:50 AM Michael Sokolov <msoko...@gmail.com> wrote:
> >
> > Hi Robert - yeah this is a complex subject. I think there's room for some
> > exciting improvement though. There is some discussion in LUCENE-8675 that
> > is pointing out some potential API-level problems we may have to address
> in
> > order to make the most efficient use of the segment structure in a sorted
> > index. I think generally speaking we're trying to think of ways to search
> > partial segments. I don't have concrete proposals for API changes at this
> > point, but it's clear there are some hurdles to be grappled with. For
> > example, Adrien's point about BKD trees having a high up-front cost
> points
> > out one difficulty. If we want to search a single segment in multiple
> "work
> > units" (whether threaded or not), we want a way to share that up-front
> cost
> > without needing to pay it over again for each work unit. I think a
> similar
> > problem also occurs with some other query types (MultiTerm can produce a
> > bitset I believe?).
> >
> > As far as the specific (prorated early termination) proposal here .. this
> > is something very specific and localized within TopFieldCollector that
> > doesn't require any public-facing API change or refactoring at all. It
> just
> > terminates a little earlier based on the segment distribution. Here's a
> PR
> > so you can see what this is:
> https://github.com/apache/lucene-solr/pull/564
> >
> >
> > On Mon, Feb 4, 2019 at 8:44 AM Robert Muir <rcm...@gmail.com> wrote:
> >
> > > Regarding adding a threshold to TopFieldCollector, do you have ideas
> > > on what it would take to fix the relevant collector/indexsearcher APIs
> > > to make this kind of thing easier? (i know this is a doozie, but we
> > > should at least try to think about it, maybe make some progress)
> > >
> > > I can see where things become less efficient in this parallel+sorted
> > > case with large top N, but there are also many other "top k
> > > algorithms" that could be better for different use cases. in your
> > > case, if you throw out the parallel and just think about doing your
> > > sorted case segment-by-segment, the current code there may be
> > > inefficient too (not as bad, but still doesn't really take total
> > > advantage of sortedness). Maybe we improve that case by scoring some
> > > initial "range" of docs for each/some segments first, and then handle
> > > any "tail". With a simple google search I easily find many ideas for
> > > how this logic could work: exact and inexact, sorted and unsorted,
> > > distributed (parallel) and sequential.  So I think there are probably
> > > other improvements that could be done here, but worry about what the
> > > code might look like if we don't refactor it.
> > >
> > > On Sun, Feb 3, 2019 at 3:14 PM Michael McCandless
> > > <luc...@mikemccandless.com> wrote:
> > > >
> > > > On Sun, Feb 3, 2019 at 10:41 AM Michael Sokolov <msoko...@gmail.com>
> > > wrote:
> > > >
> > > >  > > In single-threaded mode we can check against
> minCompetitiveScore and
> > > > > terminate collection for each segment appropriately,
> > > > >
> > > > > > Does Lucene do this today by default?  That should be a nice
> > > > > optimization,
> > > > > and it'd be safe/correct.
> > > > >
> > > > > Yes, it does that today (in TopFieldCollector -- see
> > > > >
> > > > >
> > >
> https://github.com/msokolov/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/search/TopFieldCollector.java#L225
> > > > > )
> > > > >
> > > >
> > > > Ahh -- great, thanks for finding that.
> > > >
> > > >
> > > > > Re: our high cost of collection in static ranking phase -- that is
> > > true,
> > > > > Mike, but I do also see a nice improvement on the luceneutil
> benchmark
> > > > > (modified to have a sorted index and collect concurrently) using
> just a
> > > > > vanilla TopFieldCollector. I looked at some profiler output, and it
> > > just
> > > > > seems to be showing more time spent walking postings.
> > > > >
> > > >
> > > > Yeah, understood -- I think pro-rating the N collected per segment
> makes
> > > a
> > > > lot of sense.
> > > >
> > > > Mike McCandless
> > > >
> > > > http://blog.mikemccandless.com
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > > For additional commands, e-mail: java-user-h...@lucene.apache.org
> > >
> > >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

Reply via email to