Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19864#discussion_r156847927 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala --- @@ -80,6 +80,14 @@ class CacheManager extends Logging { cachedData.isEmpty } + private def extractStatsOfPlanForCache(plan: LogicalPlan): Option[Statistics] = { + if (plan.stats.rowCount.isDefined) { --- End diff -- The expected value should be `rowCount * avgRowSize`. Without CBO, I think the file size is the best we can get, although it may not be correct. That is to say, without CBO, parquet relation may have underestimated size and cause OOM, users need to turn on CBO to fix it. So the same thing should happen in table cache. We can fix this by defining `sizeInBytes` as `file size * some factor`, but it should belong to another PR.
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org