mikemccand commented on issue #823: LUCENE-8939: Introduce Shared Count Early 
Termination In Parallel Search
URL: https://github.com/apache/lucene-solr/pull/823#issuecomment-527262380
 
 
   > > > I'm trying to understand the behavior change Lucene users will see 
with this, when using concurrent searching for one query (passing 
`ExecutorService` to `IndexSearcher`):
   > > > It looks like with the change such users will see their search 
precisely when the total collected hits exceeds the limit (1000 by default?), 
versus today where we will try to collect 1000 per segment and then reduce that 
to the top 1000 overall? So this means the results will change depending on 
thread execution/timing?
   > > 
   > > 
   > > Looking at the documentation around `TOTAL_HITS_THRESHOLD`, I see that 
it intends to restrict the number of documents scored in total before the query 
is early terminated. If we do a single threaded search today, that is the 
behavior we get. However, for concurrent search, we actually look at N * 
`TOTAL_HITS_THRESHOLD`, where N is the number of slices. So, I believe that we 
are not doing the advertised behavior for concurrent searches in the status 
quo. This change should fix that.
   > > However, you are correct that thread timing will come into play here -- 
different slices may have different contributions to the overall number of 
hits. However, since we are anyways not scoring all documents, I do not believe 
we offer any guarantees on the documents that we return -- even today, the best 
documents might be the ones which just came in and hence are on the last 
segments to be traversed, so never even get looked. WDYT?
   > 
   > OK that makes sense @atris -- it seems that which specific top hits you'll 
get back is intentionally not defined in the API and so we have the freedom to 
make improvements like this.
   
   I'm still confused about this change -- wouldn't it be better to e.g. 
pro-rate the topN per segment for the concurrent case 
(https://issues.apache.org/jira/browse/LUCENE-8681) rather than rely on the 
JVM's thread scheduling to determine which `TOTAL_HITS_THRESHOLD` hits are 
collected?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to