[GitHub] spark pull request #19864: [SPARK-22673][SQL] InMemoryRelation should utiliz...

CodingCat Wed, 13 Dec 2017 08:51:28 -0800

Github user CodingCat commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19864#discussion_r156716896
  
    --- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryRelation.scala
 ---
    @@ -71,9 +74,10 @@ case class InMemoryRelation(
     
       override def computeStats(): Statistics = {
         if (batchStats.value == 0L) {
    -      // Underlying columnar RDD hasn't been materialized, no useful 
statistics information
    -      // available, return the default statistics.
    -      Statistics(sizeInBytes = child.sqlContext.conf.defaultSizeInBytes)
    +      // Underlying columnar RDD hasn't been materialized, use the stats 
from the plan to cache when
    +      // applicable
    +      statsOfPlanToCache.getOrElse(Statistics(sizeInBytes =
    +        child.sqlContext.conf.defaultSizeInBytes))
    --- End diff --
    
    I just follow the original implementation....I do not have a better value 
to put it here...if we use sizeInBytes, the risk is in the data source like 
parquet-formatted files in which the sizeInBytes is much smaller than the 
actual size in memory....your suggestion?



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #19864: [SPARK-22673][SQL] InMemoryRelation should utiliz...

Reply via email to