Regarding adding a threshold to TopFieldCollector, do you have ideas on what it would take to fix the relevant collector/indexsearcher APIs to make this kind of thing easier? (i know this is a doozie, but we should at least try to think about it, maybe make some progress)
I can see where things become less efficient in this parallel+sorted case with large top N, but there are also many other "top k algorithms" that could be better for different use cases. in your case, if you throw out the parallel and just think about doing your sorted case segment-by-segment, the current code there may be inefficient too (not as bad, but still doesn't really take total advantage of sortedness). Maybe we improve that case by scoring some initial "range" of docs for each/some segments first, and then handle any "tail". With a simple google search I easily find many ideas for how this logic could work: exact and inexact, sorted and unsorted, distributed (parallel) and sequential. So I think there are probably other improvements that could be done here, but worry about what the code might look like if we don't refactor it. On Sun, Feb 3, 2019 at 3:14 PM Michael McCandless <luc...@mikemccandless.com> wrote: > > On Sun, Feb 3, 2019 at 10:41 AM Michael Sokolov <msoko...@gmail.com> wrote: > > > > In single-threaded mode we can check against minCompetitiveScore and > > terminate collection for each segment appropriately, > > > > > Does Lucene do this today by default? That should be a nice > > optimization, > > and it'd be safe/correct. > > > > Yes, it does that today (in TopFieldCollector -- see > > > > https://github.com/msokolov/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/search/TopFieldCollector.java#L225 > > ) > > > > Ahh -- great, thanks for finding that. > > > > Re: our high cost of collection in static ranking phase -- that is true, > > Mike, but I do also see a nice improvement on the luceneutil benchmark > > (modified to have a sorted index and collect concurrently) using just a > > vanilla TopFieldCollector. I looked at some profiler output, and it just > > seems to be showing more time spent walking postings. > > > > Yeah, understood -- I think pro-rating the N collected per segment makes a > lot of sense. > > Mike McCandless > > http://blog.mikemccandless.com --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org