attilapiros opened a new pull request #27453: [SPARK-24914][SQL] Introduce new statistic to improve data size estimate for columnar storage formats (part 1) URL: https://github.com/apache/spark/pull/27453 ### What changes were proposed in this pull request? This PR introduces a new statistic called `deserFactor` which can be set manualy as a table property 'spark.deserFactor' and intended to be used for columnar file formats as a ratio of actual data size (raw data size) to file size to scale up the file size to improve the estimate of in-memory data size and having a better query optimization (i.e., join strategy decision). ### Why are the changes needed? Before this change Spark estimated the table size as the sum of all the file sizes. This estimate can be way too low at columnar file formats where huge data can be compressed into a very small file because of serialization (like dictionary encoding) and compression. With the `deserFactor` OOM error raised as a result of a wrongly chosen broadcast join strategy can be avoided. ### Does this PR introduce any user-facing change? No ### How was this patch tested? The StatisticsSuite is extended with a new test: `SPARK-24914 - test deserialization factor`.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
