andygrove opened a new pull request, #4010:
URL: https://github.com/apache/datafusion-comet/pull/4010
## Which issue does this PR close?
Closes #4004.
Note: stacked on top of #3989; review after that merges.
## Rationale for this change
In TPC-DS plans such as q1, when partial/final hash aggregates cannot be
converted to Comet (e.g. due to unsupported aggregate expressions or types),
the shuffle between them is still wrapped as ``CometColumnarExchange``
(columnar shuffle). With a JVM operator on both sides of the shuffle, this adds
a row -> arrow -> shuffle -> arrow -> row round trip with no Comet operator on
either side able to consume columnar output:
```
HashAggregate
+- CometNativeColumnarToRow
+- CometColumnarExchange
+- HashAggregate
```
The extra conversion is pure overhead compared to a vanilla Spark row-based
shuffle.
## What changes are included in this PR?
- Added a post-transform pass ``revertRedundantColumnarShuffle`` in
``CometExecRule`` that detects ``CometShuffleExchangeExec`` with
``CometColumnarShuffle`` whose child is not a Comet plan and whose parent is
not a Comet plan, and reverts it to the original Spark ``ShuffleExchangeExec``
(preserved in the ``originalPlan`` field).
- The pass runs only in the ``COMET_EXEC_ENABLED=true`` branch of
``_apply``, so users running with ``COMET_EXEC_ENABLED=false`` (shuffle-only
mode) are unaffected.
- Regenerated TPC-DS plan-stability golden files for Spark 3.4, 3.5, and 4.0
to reflect the new pattern.
After the fix, the q1 pattern above becomes:
```
HashAggregate
+- Exchange
+- HashAggregate
```
## How are these changes tested?
- New test ``CometExecRule should not wrap shuffle in CometColumnarShuffle
when both sides are JVM`` in ``CometExecRuleSuite`` disables partial hash
aggregate to force both aggregates to stay JVM, then asserts that the shuffle
remains a plain ``ShuffleExchangeExec``.
- Existing ``CometShuffleSuite``, ``CometExecSuite``,
``CometShuffleFallbackStickinessSuite``, and both TPC-DS plan-stability suites
(v1.4 and v2.7) continue to pass against the regenerated golden files.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]