Github user CodingCat commented on a diff in the pull request:
https://github.com/apache/spark/pull/20072#discussion_r160076999
--- Diff:
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -263,6 +263,17 @@ object SQLConf {
.booleanConf
.createWithDefault(false)
+ val DISK_TO_MEMORY_SIZE_FACTOR = buildConf(
+ "spark.sql.sources.compressionFactor")
+ .internal()
+ .doc("The result of multiplying this factor with the size of data
source files is propagated " +
+ "to serve as the stats to choose the best execution plan. In the
case where the " +
+ "in-disk and in-memory size of data is significantly different,
users can adjust this " +
+ "factor for a better choice of the execution plan. The default value
is 1.0.")
+ .doubleConf
+ .checkValue(_ > 0, "the value of fileDataSizeFactor must be larger
than 0")
--- End diff --
it's not necessary to be that parquet is always smaller than memory
size...e.g. in some simple dataset (like the one used in the test), parquet's
overhead makes the overall size larger than in-memory size....
but with TPCDS dataset, I observed that parquet size is much smaller than
in-memory size
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]