Github user CodingCat commented on a diff in the pull request:
https://github.com/apache/spark/pull/19864#discussion_r155020000
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala ---
@@ -94,14 +94,16 @@ class CacheManager extends Logging {
logWarning("Asked to cache already cached data.")
} else {
val sparkSession = query.sparkSession
- cachedData.add(CachedData(
- planToCache,
- InMemoryRelation(
- sparkSession.sessionState.conf.useCompression,
- sparkSession.sessionState.conf.columnBatchSize,
- storageLevel,
- sparkSession.sessionState.executePlan(planToCache).executedPlan,
- tableName)))
+ val inMemoryRelation = InMemoryRelation(
+ sparkSession.sessionState.conf.useCompression,
+ sparkSession.sessionState.conf.columnBatchSize,
+ storageLevel,
+ sparkSession.sessionState.executePlan(planToCache).executedPlan,
+ tableName)
+ if (planToCache.conf.cboEnabled &&
planToCache.stats.rowCount.isDefined) {
--- End diff --
the reason I put it here is that when we did not enable CBO, the stats in
the underlying plan might be much smaller than the actual size in memory
leading to the potential risk of OOM error.
The underlying cause is that without CBO enabled, the size of the plan is
calculated with BaseRelation's sizeInBytes, but with CBO, we can have a more
accurate estimation,
https://github.com/apache/spark/blob/03fdc92e42d260a2b7c0090115f162ba5c091aae/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/LogicalRelation.scala#L42-L46
https://github.com/apache/spark/blob/03fdc92e42d260a2b7c0090115f162ba5c091aae/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala#L370-L381
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]