[GitHub] [spark] cloud-fan commented on a diff in pull request #38558: [SPARK-41048][SQL] Improve output partitioning and ordering with AQE cache

GitBox Wed, 09 Nov 2022 23:25:40 -0800


cloud-fan commented on code in PR #38558:
URL: https://github.com/apache/spark/pull/38558#discussion_r1018732458



##########
sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala:
##########
@@ -111,10 +112,15 @@ case class InMemoryTableScanExec(
 
   override def output: Seq[Attribute] = attributes
 
+  private def cachedPlan = relation.cachedPlan match {
+    case adaptive: AdaptiveSparkPlanExec if adaptive.isFinalized => 
adaptive.executedPlan
+    case other => other

Review Comment:
   Another idea is to materialize the AQE plan eagerly so that even the first 
cache access can be optimized. However, this requires triggering query 
execution during query planning, which is a bit risky.
   
   A good practice is to ask users to do query caching eagerly, e.g. do a 
`df.count` right after the `df` is cached. Then they won't observe 
inconsistencies. Anyway, I think this PR is a net win as it optimizes all the 
following cache accesses after the first access. This is important for query 
caching as cache is meant to be accessed repeatedly.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] cloud-fan commented on a diff in pull request #38558: [SPARK-41048][SQL] Improve output partitioning and ordering with AQE cache

Reply via email to