attilapiros opened a new pull request #26016: [SPARK-24914][SQL] New statistic to improve data size estimate for columnar storage formats URL: https://github.com/apache/spark/pull/26016 ### Why are the changes needed? Before this change Spark estimated the table size as the sum of all the file sizes. This estimate can be way too low at columnar file formats where huge data can be compressed into a very small file because of serialization (like dictionary encoding) and compression. This PR introduces a new statistic called `deserFactor` which is calculated for columnar file formats as a ratio of actual data size (raw data size) to file size which is used for scaling up the file size to improve the estimate of in-memory data size and having a better query optimization (i.e., join strategy decision). This way the OOM error which is the result of a wrongly chosen broadcast join strategy can be avoided. In case of partitioned table the factors are calculated for each files and the maximum of these factors is taken. Spark stores this factor in the meta store and reuses it so the table can grow without having to recompute this statistic. The stored factor can be removed only by a `TRUNCATE` or a `DROP` table so even a subsequent `ANALYZE TABLE` where the calculation is disabled keeps the old value. Although this intended to be a generic solution for each columnar file formats this PR currently only focusing on the ORC file format. ### Does this PR introduce any user-facing change? No ### How was this patch tested? The StatisticsSuite is extended with a new test: `SPARK-24914 - test deserialization factor (ORC)` which checks: - the factor calculation and application - keeping and using the old factor when the calculation is switched off - calculating it for multiple partitions - removing the factor at `TRUNCATE`
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
