[GitHub] [lucene-solr] jimczi commented on issue #854: Shared PQ Based Early Termination for Concurrent Search
jimczi commented on issue #854: Shared PQ Based Early Termination for Concurrent Search URL: https://github.com/apache/lucene-solr/pull/854#issuecomment-530794180 > I was thinking of a class similar to HitsThresholdChecker, allowing shared state, since, as you rightly said, the size of critical section should be small. WDYT? This is a good idea, it could be used to publish the maximum minimum value and collectors would use it to check with their local minimum too. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[GitHub] [lucene-solr] jimczi commented on issue #854: Shared PQ Based Early Termination for Concurrent Search
jimczi commented on issue #854: Shared PQ Based Early Termination for Concurrent Search URL: https://github.com/apache/lucene-solr/pull/854#issuecomment-530515005 > RE: synchronization, what are your thoughts on using a global shared array where each collector publishes its bottom value vs message passing? Sure that's a good way to keep track of the maximum minimum score without synchronization. However I think only benchmark can reveal the cost of this so my main concern is the change in the API that is required to share the minimum score globally. If we can come up with an approach that keeps the current top score docs collector simple and contained but isn't optimized in terms of synchronization I'd favor simplicity unless we have clear evidence in benchmarks that we can do better ;). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[GitHub] [lucene-solr] jimczi commented on issue #854: Shared PQ Based Early Termination for Concurrent Search
jimczi commented on issue #854: Shared PQ Based Early Termination for Concurrent Search URL: https://github.com/apache/lucene-solr/pull/854#issuecomment-530504547 I think we're talking of different approaches, hence the confusion. It is correct that we can start setting the minimum score when the global count of document that we collected reaches the requested size but if the local pqs are not full you can only use the minimum minimum score. So the bottom score of the minimum scores. Requiring a queue to be filled completely before publishing a minimum score allows to use the maximum minimum score among the slices that have a full pq. We can mix the two approaches, switching from the minimum minimum to the maximum minimum when pqs are filled but I wonder if this is really needed since topN is a small value ? Said differently I wonder if checking the global minimum score before a single pq is filled is a premature optimization ? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[GitHub] [lucene-solr] jimczi commented on issue #854: Shared PQ Based Early Termination for Concurrent Search
jimczi commented on issue #854: Shared PQ Based Early Termination for Concurrent Search URL: https://github.com/apache/lucene-solr/pull/854#issuecomment-530443453 > Which contract would that be? I believe that today, the contract that we call updateMinCompetitiveScore only when the PQ is full is implied (since there is only one PQ) but I am still not able to understand why that is mandatory. Is there something that I missing? We need to ensure that the minimum score of the slice can be set as the maximum minimum score for all slices, however if the slice is not full this means that some document with smaller scores can still enter the top N no matter what the top hits are in the other slices. So if the total hit threshold is 10 for instance and you ask for 2 top documents per slice, you cannot set the minimum score if only 1 document is inserted in each slice otherwise you can miss documents. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[GitHub] [lucene-solr] jimczi commented on issue #854: Shared PQ Based Early Termination for Concurrent Search
jimczi commented on issue #854: Shared PQ Based Early Termination for Concurrent Search URL: https://github.com/apache/lucene-solr/pull/854#issuecomment-530398602 > Thanks for offering. I have a skeletal PR in flight for this approach that I plan to publish tomorrow -- maybe we can iterate on that? Sure, thanks > I am sure I am missing something here. If the user requested top N hits, then all slices can keep collecting hits in their thread local PQs and update a global counter to reflect if total hits collected globally has reached N. Once we have reached N globally, each collector can publish the value of the bottom of their thread local PQ. The minimum of all such values will be our global minimum score, since we know that, collectively, we have N hits available. Post that, all collectors will use the global minimum score to filter hits. If, a collector finds a competitive hit, it adds it to the local queue, updates its local minimum score and triggers a resync, where the minimum of all minimum scores (if that makes sense) is taken and kept as the global worst hit. I think it would be simpler to keep the maximum minimum score on each slice. Each time a slice publish a new minimum score we can broadcast a listener to all the other top docs collector that would update their local minimum score if needed. Synchronization shouldn't be the bottleneck here but happy to be proven wrong. The global counter of total hits must be reached to publish any minimum score but the publisher must also ensure that his local pq is full before publishing since it is possible to reach the total hits threshold while none of the local pq are completely filled so this would break the contract. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[GitHub] [lucene-solr] jimczi commented on issue #854: Shared PQ Based Early Termination for Concurrent Search
jimczi commented on issue #854: Shared PQ Based Early Termination for Concurrent Search URL: https://github.com/apache/lucene-solr/pull/854#issuecomment-530373796 I am not sure that synchronization would be an issue in this model since the update of the global min score should be quick, we can also use double checked locking or use a volatile. But it's hard to say without benchmarking as Mike said. It would be easier to discuss with a poc to compare with the current approach in this pr. I can work on one if you want to concentrate on the shared pq path ? > Once globally, numHits are collected, we do a pass through the array and pick the minimum? If numHits means the global threshold then it's just one condition. The slice must also have a full pq to publish a min score and other slices can use it if they need to collect the same number of top docs. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[GitHub] [lucene-solr] jimczi commented on issue #854: Shared PQ Based Early Termination for Concurrent Search
jimczi commented on issue #854: Shared PQ Based Early Termination for Concurrent Search URL: https://github.com/apache/lucene-solr/pull/854#issuecomment-530345711 I think that you need to ensure that the slice publishing the new min score is full, otherwise the min score cannot be the global min score. If the other slices are not full they can use the new min score immediately but that's assuming that all slices have the same topN size. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[GitHub] [lucene-solr] jimczi commented on issue #854: Shared PQ Based Early Termination for Concurrent Search
jimczi commented on issue #854: Shared PQ Based Early Termination for Concurrent Search URL: https://github.com/apache/lucene-solr/pull/854#issuecomment-530342313 > Is this proposal to continue collecting the full N per slice (thread work unit), until some/all of them have a full (top N) PQ and at that point you sync to find the "best bottom" across all of them? And every time one of the threads finds a newly competitive hit, increasing its bottom, we sync again to find the best bottom? Yes that's another option that we should investigate imo. > This would still require sync-per-insert, and would require collecting top N per thread instead of top N overall? It's hard to judge which is the better tradeoff w/o building both and running perf tests :) If we keep separate pq we'd not need sync-per-insert and could use a volatile to record the shared min score within the top docs collectors. It seems appealing to me because the top N should be small so the overhead of collecting a full topN per slice is small compared to the converging of the min score which can be completely different per slice. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[GitHub] [lucene-solr] jimczi commented on issue #854: Shared PQ Based Early Termination for Concurrent Search
jimczi commented on issue #854: Shared PQ Based Early Termination for Concurrent Search URL: https://github.com/apache/lucene-solr/pull/854#issuecomment-528035227 I thought that the follow up for LUCENE-8939 would be to allow the sharing of the minimum score (could be extended to a minimum FieldDoc) across slices ? Sharing the minimum score (or minimum FieldDoc) requires very little synchronization while a global priority queue seems much more costly. The other advantage is that we could add this ability in the current topdocs collector like we did for LUCENE-8939. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org