alexmm-amzn opened a new issue, #15136:
URL: https://github.com/apache/lucene/issues/15136

   ### Description
   
   The `FirstPassGroupingCollector` 
[[1](https://github.com/apache/lucene/blob/main/lucene/grouping/src/java/org/apache/lucene/search/grouping/FirstPassGroupingCollector.java)]
 from the `lucene-grouping` module is used by OpenSearch to implement 
`collapse` search queries that deduplicate the search results using a numeric 
or keyword docvalues field. This logic is implemented in the 
`CollapsingTopDocsCollector` 
[[2](https://github.com/opensearch-project/OpenSearch/blob/main/server/src/main/java/org/apache/lucene/search/grouping/CollapsingTopDocsCollector.java)],
 a subclass of `FirstPassGroupingCollector`.
   
   Comparing the search performance between `CollapsingTopDocsCollector` and 
the regular `TopFieldCollector` 
[[3](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/TopFieldCollector.java)]
 reveals massive differences, in particular for large hit counts. 
   
   Reviewing the source code it looks like the `FirstPassGroupingCollector` 
lacks many of the features that `TopFieldCollector` provides to improve search 
performance:
   
   - No support for early terminating search queries when using index sorting.
   - No support for non-COMPLETE `ScoreMode`s and setting minimum competitive 
scores, i.e. no skipping of non-competitive documents even if sorted by 
relevance score.
   - No support for competitive iterators and hit thresholds, i.e. no pruning 
for indexed numeric sort fields.
   
   This causes the collector to visit all hits exhaustively and causes major 
performance issues with large indexes.
   
   For example, we have seen search queries with ~1MM hits using an indexed 
numeric sort field achieve ~15ms with `TopFieldCollector` and ~75ms with 
`CollapsingTopDocsCollector`. Using a non-indexed numeric sort field (i.e. no 
competitive iterator is used) the latencies for both collectors are almost 
equal.
   
   Can the `FirstPassGroupingCollector` be improved to support the features 
listed above? Or should this be solved with an entirely new collector built 
from scratch?
   
   I'm looking for guidance from the Lucene maintainers. Thanks!
   
   Ticket reference: 
https://github.com/opensearch-project/OpenSearch/issues/18861
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to