attilapiros opened a new pull request #27453: [SPARK-24914][SQL] Introduce new 
statistic to improve data size estimate for columnar storage formats (part 1)
URL: https://github.com/apache/spark/pull/27453
 
 
   ### What changes were proposed in this pull request?
   
   This PR introduces a new statistic called `deserFactor` which can be set 
manualy as a table property 'spark.deserFactor' and intended to be used for 
columnar file formats as a ratio of actual data size (raw data size) to file 
size to scale up the file size to improve the estimate of in-memory data size 
and having a better query optimization (i.e., join strategy decision).
   
   ### Why are the changes needed?
   
   Before this change Spark estimated the table size as the sum of all the file 
sizes. This estimate can be way too low at columnar file formats where huge 
data can be compressed into a very small file because of serialization (like 
dictionary encoding) and compression. 
   With the `deserFactor` OOM error raised as a result of a wrongly chosen 
broadcast join strategy can be avoided.
   
   ### Does this PR introduce any user-facing change?
   
   No
   
   ### How was this patch tested?
   
   The StatisticsSuite is extended with a new test: `SPARK-24914 - test 
deserialization factor`.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to