OK - I opened https://issues.apache.org/jira/browse/LUCENE-8681 and talked about the possible user knobs I think we could provide.
On Tue, Feb 5, 2019 at 9:10 AM Robert Muir <rcm...@gmail.com> wrote: > OK. Thanks for uploading the PR. I think you should definitely open an > issue? > > Its still worth looking at the API for this specific proposal, as this > may not return the exact top-N, correct? I think we need to be careful > and keep the user involved when it comes to inexact, not everyone's > data might meet the expected distribution and maybe others would > rather have the exact results. And of course this part of the api (the > way the user specifies such things) is way more important than how > ugly the collector code is behind the scenes. When I was looking at > this topic years ago, i looked at basically an enum being passed to > IndexSearcher with 3 possible values: exact top N with full counts > (slowest), exact top N without full counts (faster), inexact top N > (fastest). When Adrien did the actual work, the enum seemed overkill > because only 2 of the possibilties were implemented, but maybe its > worth revisiting? > > On Tue, Feb 5, 2019 at 7:50 AM Michael Sokolov <msoko...@gmail.com> wrote: > > > > Hi Robert - yeah this is a complex subject. I think there's room for some > > exciting improvement though. There is some discussion in LUCENE-8675 that > > is pointing out some potential API-level problems we may have to address > in > > order to make the most efficient use of the segment structure in a sorted > > index. I think generally speaking we're trying to think of ways to search > > partial segments. I don't have concrete proposals for API changes at this > > point, but it's clear there are some hurdles to be grappled with. For > > example, Adrien's point about BKD trees having a high up-front cost > points > > out one difficulty. If we want to search a single segment in multiple > "work > > units" (whether threaded or not), we want a way to share that up-front > cost > > without needing to pay it over again for each work unit. I think a > similar > > problem also occurs with some other query types (MultiTerm can produce a > > bitset I believe?). > > > > As far as the specific (prorated early termination) proposal here .. this > > is something very specific and localized within TopFieldCollector that > > doesn't require any public-facing API change or refactoring at all. It > just > > terminates a little earlier based on the segment distribution. Here's a > PR > > so you can see what this is: > https://github.com/apache/lucene-solr/pull/564 > > > > > > On Mon, Feb 4, 2019 at 8:44 AM Robert Muir <rcm...@gmail.com> wrote: > > > > > Regarding adding a threshold to TopFieldCollector, do you have ideas > > > on what it would take to fix the relevant collector/indexsearcher APIs > > > to make this kind of thing easier? (i know this is a doozie, but we > > > should at least try to think about it, maybe make some progress) > > > > > > I can see where things become less efficient in this parallel+sorted > > > case with large top N, but there are also many other "top k > > > algorithms" that could be better for different use cases. in your > > > case, if you throw out the parallel and just think about doing your > > > sorted case segment-by-segment, the current code there may be > > > inefficient too (not as bad, but still doesn't really take total > > > advantage of sortedness). Maybe we improve that case by scoring some > > > initial "range" of docs for each/some segments first, and then handle > > > any "tail". With a simple google search I easily find many ideas for > > > how this logic could work: exact and inexact, sorted and unsorted, > > > distributed (parallel) and sequential. So I think there are probably > > > other improvements that could be done here, but worry about what the > > > code might look like if we don't refactor it. > > > > > > On Sun, Feb 3, 2019 at 3:14 PM Michael McCandless > > > <luc...@mikemccandless.com> wrote: > > > > > > > > On Sun, Feb 3, 2019 at 10:41 AM Michael Sokolov <msoko...@gmail.com> > > > wrote: > > > > > > > > > > In single-threaded mode we can check against > minCompetitiveScore and > > > > > terminate collection for each segment appropriately, > > > > > > > > > > > Does Lucene do this today by default? That should be a nice > > > > > optimization, > > > > > and it'd be safe/correct. > > > > > > > > > > Yes, it does that today (in TopFieldCollector -- see > > > > > > > > > > > > > > https://github.com/msokolov/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/search/TopFieldCollector.java#L225 > > > > > ) > > > > > > > > > > > > > Ahh -- great, thanks for finding that. > > > > > > > > > > > > > Re: our high cost of collection in static ranking phase -- that is > > > true, > > > > > Mike, but I do also see a nice improvement on the luceneutil > benchmark > > > > > (modified to have a sorted index and collect concurrently) using > just a > > > > > vanilla TopFieldCollector. I looked at some profiler output, and it > > > just > > > > > seems to be showing more time spent walking postings. > > > > > > > > > > > > > Yeah, understood -- I think pro-rating the N collected per segment > makes > > > a > > > > lot of sense. > > > > > > > > Mike McCandless > > > > > > > > http://blog.mikemccandless.com > > > > > > --------------------------------------------------------------------- > > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >