Question about grouping in distribute mode

Diego Ceccarelli Thu, 30 Mar 2017 06:26:51 -0700

Hello, I'm currently working on Solr grouping in order to support reranking
[1].
I've a working patch for non distribute search, and I'm now working on the
distribute setting.


Looking at the code of distribute grouping (top-k groups, top-n documents
for each group) search consists in:

GROUPING_DISTRIBUTED_FIRST
1. given the grouping query, each shard will return the top-k groups
2. federator will merge the top-k groups and will produce the top-k groups
for the query

GROUPING_DISTRIBUTED_SECOND
1. given the top-k groups  each shard will return its top-n documents for
each group.
2. federator will then compute top-n documents for each group merging all
the shards responses.

GET_FIELDS
as usual

My plan was to change the collector in GROUPING_DISTRIBUTED_SECOND, and
return
the top documents for each group with a new score given by the function
used to rerank
(affecting maxScore for each group and then also the order of the groups).
Looking at the code then I realized that TopGroups asserts that order of
the groups is not changing,
and I realized that indeed _ if the ranking function is the same, group
order can't change after the first stage _.

My question is: if the user is interested only in the top document for each
group (i.e., the default: group.limit = 1) do we really need
GROUPING_DISTRIBUTED_SECOND, or could we skip it?
is there any reason to perform grouping distributed second in this case? or
we could just return the top docid together with the topgroups in
GROUPING_DISTRIBUTED_FIRST and then go directly to GET_FIELDS?

Cheers,
Diego

[1] https://issues.apache.org/jira/browse/SOLR-8542

Question about grouping in distribute mode

Reply via email to