[GitHub] [spark] viirya commented on a change in pull request #34642: [SPARK-37369][SQL] Avoid redundant ColumnarToRow transistion on InMemoryTableScan

GitBox Sat, 11 Dec 2021 22:08:03 -0800


viirya commented on a change in pull request #34642:
URL: https://github.com/apache/spark/pull/34642#discussion_r766931441




##########
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala
##########
@@ -71,6 +71,11 @@ abstract class SparkPlan extends QueryPlan[SparkPlan] with 
Logging with Serializ
 
   val id: Int = SparkPlan.newPlanId()
 
+  /**
+   * Return true if this stage of the plan supports row-based execution.

Review comment:
       > Hmm any example? every physical plan node has to implement doExecute 
which outputs rows, while doExecuteColumnar throws exception by default.
   
   I think there is no guarantee that a physical node must implement a working 
`doExecute`. For a columnar node, it can just throw exception saying it is not 
implemented (like the default `doExecuteColumnar`) if it is not designed to be 
executed under row-based execution.
   
   I also don't see a need to have implement both (working) row-based and 
columnar execution for a node in general. But in Spark, because we don't 
actually have official columnar execution nodes, so maybe I cannot get an 
example from Spark itself. Hopefully I convey the idea clearly.
   
   > prefersColumnar: this plan prefers to output columnar batches even if it 
is not explicitly requested (e.g., outputsColumnar is false).
   
   BTW, `outputsColumnar` is not a preference option I think (at least for its 
usage now in the rule). It actually means that the output should be in columnar 
or not. Once `outputsColumnar` is false, the plan should output row-based 
output and it is why we add `ColumnarToRowExec` for the case.
   
   Yea, the preference I mentioned is pretty limited so far. I agree that we 
maybe need to have a preference rule (or something) in the future. As we don't 
have real built-in columnar operators in Spark, so currently the situation 
seems that some columnar extensions/libraries replace row-based operators with 
columnar operators during planning. I'm not sure if we can estimate which one 
is preferred during planning.
   
   

##########
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala
##########
@@ -71,6 +71,11 @@ abstract class SparkPlan extends QueryPlan[SparkPlan] with 
Logging with Serializ
 
   val id: Int = SparkPlan.newPlanId()
 
+  /**
+   * Return true if this stage of the plan supports row-based execution.

Review comment:
       BTW, IMHO, if we add columnar support for more operators in the future, 
I guess it already implicitly indicates we "prefer" it over current execution 
(whole-stage codegen or interpreted one)? Just like whole-stage codegen, seems 
we simply prefer it once we verify it having better performance generally. This 
is similar to the 3rd party extensions/libraries situation, I think.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] viirya commented on a change in pull request #34642: [SPARK-37369][SQL] Avoid redundant ColumnarToRow transistion on InMemoryTableScan

Reply via email to