Github user hvanhovell commented on a diff in the pull request:
https://github.com/apache/spark/pull/20394#discussion_r163943771
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryRelation.scala
---
@@ -73,11 +73,16 @@ case class InMemoryRelation(
@transient val partitionStatistics = new PartitionStatistics(output)
override def computeStats(): Statistics = {
- if (batchStats.value == 0L) {
- // Underlying columnar RDD hasn't been materialized, use the stats
from the plan to cache
- statsOfPlanToCache
+ if (sizeInBytesStats.value == 0L) {
+ // Underlying columnar RDD hasn't been materialized, use the stats
from the plan to cache.
+ // Note that we should drop the hint info here. We may cache a plan
whose root node is a hint
+ // node. When we lookup the cache with a semantically same plan
without hint info, the plan
+ // returned by cache lookup should not have hint info. If we lookup
the cache with a
+ // semantically same plan with a different hint info,
`CacheManager.useCachedData` will take
+ // care of it and retain the hint info in the lookup input plan.
+ statsOfPlanToCache.copy(hints = HintInfo())
--- End diff --
I am not sure I agree with this. If we cache a plan with a hint, then it is
reasonable to expect that the hint is still in the plan. We do the same with
temporary views.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]