Github user CodingCat commented on a diff in the pull request:
https://github.com/apache/spark/pull/19864#discussion_r156716896
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryRelation.scala
---
@@ -71,9 +74,10 @@ case class InMemoryRelation(
override def computeStats(): Statistics = {
if (batchStats.value == 0L) {
- // Underlying columnar RDD hasn't been materialized, no useful
statistics information
- // available, return the default statistics.
- Statistics(sizeInBytes = child.sqlContext.conf.defaultSizeInBytes)
+ // Underlying columnar RDD hasn't been materialized, use the stats
from the plan to cache when
+ // applicable
+ statsOfPlanToCache.getOrElse(Statistics(sizeInBytes =
+ child.sqlContext.conf.defaultSizeInBytes))
--- End diff --
I just follow the original implementation....I do not have a better value
to put it here...if we use sizeInBytes, the risk is in the data source like
parquet-formatted files in which the sizeInBytes is much smaller than the
actual size in memory....your suggestion?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]