[I] Improve FirstPassGroupingCollector to support early termination and pruning/skipping [lucene]

via GitHub Thu, 28 Aug 2025 05:42:08 -0700


alexmm-amzn opened a new issue, #15136:
URL: https://github.com/apache/lucene/issues/15136

### Description

The `FirstPassGroupingCollector`
[[1](https://github.com/apache/lucene/blob/main/lucene/grouping/src/java/org/apache/lucene/search/grouping/FirstPassGroupingCollector.java)]
from the `lucene-grouping` module is used by OpenSearch to implement
`collapse` search queries that deduplicate the search results using a numeric
or keyword docvalues field. This logic is implemented in the
`CollapsingTopDocsCollector`
[[2](https://github.com/opensearch-project/OpenSearch/blob/main/server/src/main/java/org/apache/lucene/search/grouping/CollapsingTopDocsCollector.java)],
a subclass of `FirstPassGroupingCollector`.

Comparing the search performance between `CollapsingTopDocsCollector` and
the regular `TopFieldCollector`
[[3](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/TopFieldCollector.java)]
reveals massive differences, in particular for large hit counts.

Reviewing the source code it looks like the `FirstPassGroupingCollector`
lacks many of the features that `TopFieldCollector` provides to improve search
performance:

- No support for early terminating search queries when using index sorting.
- No support for non-COMPLETE `ScoreMode`s and setting minimum competitive
scores, i.e. no skipping of non-competitive documents even if sorted by
relevance score.
- No support for competitive iterators and hit thresholds, i.e. no pruning
for indexed numeric sort fields.

This causes the collector to visit all hits exhaustively and causes major
performance issues with large indexes.

For example, we have seen search queries with ~1MM hits using an indexed
numeric sort field achieve ~15ms with `TopFieldCollector` and ~75ms with
`CollapsingTopDocsCollector`. Using a non-indexed numeric sort field (i.e. no
competitive iterator is used) the latencies for both collectors are almost
equal.

Can the `FirstPassGroupingCollector` be improved to support the features
listed above? Or should this be solved with an entirely new collector built
from scratch?

I'm looking for guidance from the Lucene maintainers. Thanks!

Ticket reference:
https://github.com/opensearch-project/OpenSearch/issues/18861

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[I] Improve FirstPassGroupingCollector to support early termination and pruning/skipping [lucene]

Reply via email to