[GitHub] spark pull request #19864: [SPARK-22673][SQL] InMemoryRelation should utiliz...

cloud-fan Wed, 13 Dec 2017 19:43:58 -0800

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19864#discussion_r156847927
  
    --- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala ---
    @@ -80,6 +80,14 @@ class CacheManager extends Logging {
         cachedData.isEmpty
       }
     
    +  private def extractStatsOfPlanForCache(plan: LogicalPlan): 
Option[Statistics] = {
    +    if (plan.stats.rowCount.isDefined) {
    --- End diff --
    
    The expected value should be `rowCount * avgRowSize`. Without CBO, I think 
the file size is the best we can get, although it may not be correct.
    
    That is to say, without CBO, parquet relation may have underestimated size 
and cause OOM, users need to turn on CBO to fix it. So the same thing should 
happen in table cache.
    
    We can fix this by defining `sizeInBytes` as `file size * some factor`, but 
it should belong to another PR.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19864: [SPARK-22673][SQL] InMemoryRelation should utiliz...

Reply via email to