karuppayya commented on code in PR #4010:
URL: https://github.com/apache/datafusion-comet/pull/4010#discussion_r3119024954
##########
spark/src/main/scala/org/apache/comet/rules/CometExecRule.scala:
##########
@@ -97,6 +97,32 @@ case class CometExecRule(session: SparkSession) extends
Rule[SparkPlan] {
private lazy val showTransformations =
CometConf.COMET_EXPLAIN_TRANSFORMATIONS.get()
+ /**
+ * Revert any `CometShuffleExchangeExec` with `CometColumnarShuffle` that is
sandwiched between
+ * two non-Comet operators back to the original Spark `ShuffleExchangeExec`.
Columnar shuffle
+ * converts row-based input to Arrow batches for the shuffle read side; if
neither the parent
+ * nor the child is a Comet plan that can consume columnar output, that
conversion is pure
+ * overhead (row->arrow->shuffle->arrow->row vs. row->shuffle->row).
+ */
+ private def revertRedundantColumnarShuffle(plan: SparkPlan): SparkPlan = {
Review Comment:
I agree that this optimization will improve performance and compute
efficiency.
My main concern is determining the best recommendation for users to tune
memory, particularly since they cannot explicitly disable it.
Also can it be a seperate rule in itself and have it only in
`org.apache.comet.CometSparkSessionExtensions.CometExecColumnar#postColumnarTransitions`?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]