[GitHub] spark pull request #20394: [SPARK-23214][SQL] cached data should not carry e...

hvanhovell Thu, 25 Jan 2018 11:27:17 -0800

Github user hvanhovell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20394#discussion_r163943771
  
    --- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryRelation.scala
 ---
    @@ -73,11 +73,16 @@ case class InMemoryRelation(
       @transient val partitionStatistics = new PartitionStatistics(output)
     
       override def computeStats(): Statistics = {
    -    if (batchStats.value == 0L) {
    -      // Underlying columnar RDD hasn't been materialized, use the stats 
from the plan to cache
    -      statsOfPlanToCache
    +    if (sizeInBytesStats.value == 0L) {
    +      // Underlying columnar RDD hasn't been materialized, use the stats 
from the plan to cache.
    +      // Note that we should drop the hint info here. We may cache a plan 
whose root node is a hint
    +      // node. When we lookup the cache with a semantically same plan 
without hint info, the plan
    +      // returned by cache lookup should not have hint info. If we lookup 
the cache with a
    +      // semantically same plan with a different hint info, 
`CacheManager.useCachedData` will take
    +      // care of it and retain the hint info in the lookup input plan.
    +      statsOfPlanToCache.copy(hints = HintInfo())
    --- End diff --
    
    I am not sure I agree with this. If we cache a plan with a hint, then it is 
reasonable to expect that the hint is still in the plan. We do the same with 
temporary views.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #20394: [SPARK-23214][SQL] cached data should not carry e...

Reply via email to