alexmm-amzn opened a new issue, #15136: URL: https://github.com/apache/lucene/issues/15136
### Description The `FirstPassGroupingCollector` [[1](https://github.com/apache/lucene/blob/main/lucene/grouping/src/java/org/apache/lucene/search/grouping/FirstPassGroupingCollector.java)] from the `lucene-grouping` module is used by OpenSearch to implement `collapse` search queries that deduplicate the search results using a numeric or keyword docvalues field. This logic is implemented in the `CollapsingTopDocsCollector` [[2](https://github.com/opensearch-project/OpenSearch/blob/main/server/src/main/java/org/apache/lucene/search/grouping/CollapsingTopDocsCollector.java)], a subclass of `FirstPassGroupingCollector`. Comparing the search performance between `CollapsingTopDocsCollector` and the regular `TopFieldCollector` [[3](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/TopFieldCollector.java)] reveals massive differences, in particular for large hit counts. Reviewing the source code it looks like the `FirstPassGroupingCollector` lacks many of the features that `TopFieldCollector` provides to improve search performance: - No support for early terminating search queries when using index sorting. - No support for non-COMPLETE `ScoreMode`s and setting minimum competitive scores, i.e. no skipping of non-competitive documents even if sorted by relevance score. - No support for competitive iterators and hit thresholds, i.e. no pruning for indexed numeric sort fields. This causes the collector to visit all hits exhaustively and causes major performance issues with large indexes. For example, we have seen search queries with ~1MM hits using an indexed numeric sort field achieve ~15ms with `TopFieldCollector` and ~75ms with `CollapsingTopDocsCollector`. Using a non-indexed numeric sort field (i.e. no competitive iterator is used) the latencies for both collectors are almost equal. Can the `FirstPassGroupingCollector` be improved to support the features listed above? Or should this be solved with an entirely new collector built from scratch? I'm looking for guidance from the Lucene maintainers. Thanks! Ticket reference: https://github.com/opensearch-project/OpenSearch/issues/18861 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org