rishvin commented on issue #1994: URL: https://github.com/apache/datafusion-comet/issues/1994#issuecomment-3316214799
Did some initial investigation on this. Unlike Spark, Datafusion does not have a separate operator to perform sort aggregation. However, Datafusion's `AggregateExec` internally can mimic Spark's `SortAggregateExec` behaviour. Found [this related blog](https://datafusion.apache.org/blog/2025/03/11/ordering-analysis/?utm_source=chatgpt.com) post from Datafusion which provides details on how Datafusion detects the sortedness of data and leverages it avoid sorting. This mechanism is used by `AggregateExec` to detect if the incoming data is sorted and leverage it to perform aggregation. If the data is sorted, the `InputOrderMode` is set to `Sorted` ([code link](https://github.com/apache/datafusion/blob/1629420162815e1a725b9be5344fafd1fbe2e5ff/datafusion/physical-plan/src/aggregates/mod.rs#L531)). The `InputOrderMode` helps derive the grouping order, [here](https://github.com/apache/datafusion/blob/1629420162815e1a725b9be5344fafd1fbe2e5ff/datafusion/physical-plan/src/aggregates/order/mod.rs#L44). For sorted grouping data, the `AggregateExec` will not wait to emit until `EOF` is found, it will emit once it sees the next group. This emitting logic is handle [here](https://github.com/apache/datafusion/blob/1629 420162815e1a725b9be5344fafd1fbe2e5ff/datafusion/physical-plan/src/aggregates/order/mod.rs#L56). Based on this, the initial impression is that we already have native mechanism is place to achieve this sort aggregation. However, Comet does not serialize `SortAggregateExec`, it only serializes `HashAggregateExec` [here](https://github.com/apache/datafusion-comet/blob/f1fb980518342e91fae8358a4de3d76b7bbf9d4d/spark/src/main/scala/org/apache/comet/serde/QueryPlanSerde.scala#L1286). We may serialize the sort aggregate plan and leverage the existing `AggregateExec` to do all the heavy lifting. **Next Steps** - Do spike to tangibly understand the actual work needed. - If possible, do benchmarking. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
