[GitHub] [spark] HyukjinKwon opened a new pull request #35058: [SPARK-37779][SQL] Make ColumnarToRowExec plan canonicalizable after (de)serialization

GitBox Wed, 29 Dec 2021 02:25:03 -0800


HyukjinKwon opened a new pull request #35058:
URL: https://github.com/apache/spark/pull/35058



   ### What changes were proposed in this pull request?
   
   This PR proposes to just ignore exceptions by transient fields on the 
`supportsColumnar` sanity check at `ColumnarToRowExec`.
   
   ### Why are the changes needed?
   
   SPARK-23731 fixed the plans to be serializable by leveraging lazy but 
SPARK-28213 happened to refer to the lazy variable at: 
https://github.com/apache/spark/blob/77b164aac9764049a4820064421ef82ec0bc14fb/sql/core/src/main/scala/org/apache/spark/sql/execution/Columnar.scala#L68
   
   This can fail during canonicalization during, for example, eliminating sub 
common expressions:
   
   ```
   java.lang.NullPointerException
       at 
org.apache.spark.sql.execution.FileSourceScanExec.supportsColumnar$lzycompute(DataSourceScanExec.scala:280)
       at 
org.apache.spark.sql.execution.FileSourceScanExec.supportsColumnar(DataSourceScanExec.scala:279)
       at 
org.apache.spark.sql.execution.InputAdapter.supportsColumnar(WholeStageCodegenExec.scala:509)
       at 
org.apache.spark.sql.execution.ColumnarToRowExec.<init>(Columnar.scala:67)
       ...
       at 
org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized$lzycompute(QueryPlan.scala:581)
       at 
org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized(QueryPlan.scala:580)
       at 
org.apache.spark.sql.execution.ScalarSubquery.canonicalized$lzycompute(subquery.scala:110)
       ...
       at 
org.apache.spark.sql.catalyst.expressions.ExpressionEquals.hashCode(EquivalentExpressions.scala:275)
       ...
       at scala.collection.mutable.HashTable.findEntry$(HashTable.scala:135)
       at scala.collection.mutable.HashMap.findEntry(HashMap.scala:44)
       at scala.collection.mutable.HashMap.get(HashMap.scala:74)
       at 
org.apache.spark.sql.catalyst.expressions.EquivalentExpressions.addExpr(EquivalentExpressions.scala:46)
       at 
org.apache.spark.sql.catalyst.expressions.EquivalentExpressions.addExprTreeHelper$1(EquivalentExpressions.scala:147)
       at 
org.apache.spark.sql.catalyst.expressions.EquivalentExpressions.addExprTree(EquivalentExpressions.scala:170)
       at 
org.apache.spark.sql.catalyst.expressions.SubExprEvaluationRuntime.$anonfun$proxyExpressions$1(SubExprEvaluationRuntime.scala:89)
       at 
org.apache.spark.sql.catalyst.expressions.SubExprEvaluationRuntime.$anonfun$proxyExpressions$1$adapted(SubExprEvaluationRuntime.scala:89)
       at scala.collection.immutable.List.foreach(List.scala:392)
   ```
   
   This fix is still a bandaid fix but at least addresses the issue with 
minimized change. This fix should ideally be backported too.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Pretty unlikely - when `ColumnarToRowExec` has to be canonicalized on the 
executor side (see the stacktrace), but yes. it would fix a bug.
   
   ### How was this patch tested?
   
   Unittest was added.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] HyukjinKwon opened a new pull request #35058: [SPARK-37779][SQL] Make ColumnarToRowExec plan canonicalizable after (de)serialization

Reply via email to