Github user CodingCat commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20072#discussion_r160076999
  
    --- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
    @@ -263,6 +263,17 @@ object SQLConf {
         .booleanConf
         .createWithDefault(false)
     
    +  val DISK_TO_MEMORY_SIZE_FACTOR = buildConf(
    +    "spark.sql.sources.compressionFactor")
    +    .internal()
    +    .doc("The result of multiplying this factor with the size of data 
source files is propagated " +
    +      "to serve as the stats to choose the best execution plan. In the 
case where the " +
    +      "in-disk and in-memory size of data is significantly different, 
users can adjust this " +
    +      "factor for a better choice of the execution plan. The default value 
is 1.0.")
    +    .doubleConf
    +    .checkValue(_ > 0, "the value of fileDataSizeFactor must be larger 
than 0")
    --- End diff --
    
    it's not necessary to be that parquet is always smaller than memory 
size...e.g. in some simple dataset (like the one used in the test), parquet's 
overhead makes the overall size larger than in-memory size....
    
    but with TPCDS dataset, I observed that parquet size is much smaller than 
in-memory size


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to