kkrugler opened a new issue #7206:
URL: https://github.com/apache/pinot/issues/7206
Currently per-segment results from an aggregation are trimmed to `Max(5000,
<query limit> * 5)`. This can result in incorrect results when these partial
results are combined by the broker, most obviously where the ordering of groups
is opposite the value being used to do the ordering.
For example, assume a table with two fields, `studentId` and `testScore`. If
there are multiple test scores per student, and rows for the same student can
be stored in two different segments, and there are more than 5000 unique
student ids in each segment, then this query is problematic:
``` sql
SELECT studentId, min(testScore) AS lowScore
FROM testResultsTable
GROUP BY studentId
ORDER BY lowScore DESC
LIMIT 10
```
If `Student1` has a single row with a `testScore` of 0 in segment A, and
that isn't one of the highest 5000 group results from that segment (likely),
then it doesn't get returned to the broker. If `Student1` also has a single row
with `testScore` of 100 in segment B, and this is one of the top 5000 results
from this segment (likely), then it will be returned to the broker.
So now when the broker recreates the combined aggregation results for
`Student1`, it appears that the low score is 100, and it gets returned as one
of the top 10 results...which isn't correct.
In this situation, the only way to get accurate results is to return **ALL**
of the groups from each segment, so that the broker can calculate correct
results. You can force this by using a very high limit, but (a) you'd need to
know the max cardinality of of groups for any segment that participates in the
query, and (b) the broker is doing a lot of wasted work in generating a very
large result that will mostly be ignored.
@Jackie-Jiang has suggested supporting a new option (e.g.
`OPTION(noGroupTrim=true)`) which would force all groups to be returned to the
broker.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]