Github user cloud-fan commented on a diff in the pull request:
https://github.com/apache/spark/pull/19864#discussion_r156847927
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala ---
@@ -80,6 +80,14 @@ class CacheManager extends Logging {
cachedData.isEmpty
}
+ private def extractStatsOfPlanForCache(plan: LogicalPlan):
Option[Statistics] = {
+ if (plan.stats.rowCount.isDefined) {
--- End diff --
The expected value should be `rowCount * avgRowSize`. Without CBO, I think
the file size is the best we can get, although it may not be correct.
That is to say, without CBO, parquet relation may have underestimated size
and cause OOM, users need to turn on CBO to fix it. So the same thing should
happen in table cache.
We can fix this by defining `sizeInBytes` as `file size * some factor`, but
it should belong to another PR.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]