[I] Support parallel combine and disk spill for groupBy execution [pinot]


wirybeaver opened a new issue, #12080:
URL: https://github.com/apache/pinot/issues/12080

When I read the source code Pinot's GroupByExecutor, I found out it lacks of
the following features of Druid's GroupByV2Engine:
1. Spill to disk for merging buffer. [Druid
ParallelCombiner](https://github.com/apache/druid/blob/9f3b26676d30f90599a7d55e43549617e0cee082/processing/src/main/java/org/apache/druid/query/groupby/epinephelinae/ParallelCombiner.java#L64)
2. Parallel combine when merging sorted aggregation result. Druid will
create a combining tree thread for local historical nodes. [Druid
SpillingGrouper](https://github.com/apache/druid/blob/9f3b26676d30f90599a7d55e43549617e0cee082/processing/src/main/java/org/apache/druid/query/groupby/epinephelinae/SpillingGrouper.java)
// o <- non-leaf node
// / / \ \ <- ICD = 4
// o o o o <- non-leaf nodes
// / \ / \ / \ / \ <- LCD = 2
// o o o o o o o o <- leaf nodes

Reference: [Druid GroupBy Tuning
Guide](https://druid.apache.org/docs/latest/querying/groupbyquery/)

Druid seems to always sort the aggregate result by default when the Limit
pushdown is not enabled as the tuning guide mentioned. I have a strong feeling
that integrating DiskSpill feature allows Pinot to process large scale of data
and resolve the issue of indeterministic result for groupBy without orderBy,
i.e. https://github.com/apache/pinot/issues/11706. In addition, the NonLeaf
stage in Multistage V2 can also adopts those two features for partitioned
aggregation.

Raise this issue to solicit opinions from folks. If there's sufficient
support, I will write a design doc for leaf stage group by execution.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to