[GitHub] [spark] attilapiros opened a new pull request #26016: [SPARK-24914][SQL] New statistic to improve data size estimate for columnar storage formats

GitBox Thu, 03 Oct 2019 09:44:53 -0700

attilapiros opened a new pull request #26016: [SPARK-24914][SQL] New statistic 
to improve data size estimate for columnar storage formats
URL: https://github.com/apache/spark/pull/26016
 
 
   ### Why are the changes needed?
   
   Before this change Spark estimated the table size as the sum of all the file 
sizes. This estimate can be way too low at columnar file formats where huge 
data can be compressed into a very small file because of serialization (like 
dictionary encoding) and compression. 
   
   This PR introduces a new statistic called `deserFactor` which is calculated 
for columnar file formats as a ratio of actual data size (raw data size) to 
file size which is used for scaling up the file size to improve the estimate of 
in-memory data size and having a better query optimization (i.e., join strategy 
decision). This way the OOM error which is the result of a wrongly chosen 
broadcast join strategy can be avoided.
   
   In case of partitioned table the factors are calculated for each files and 
the maximum of these factors is taken. Spark stores this factor in the meta 
store and reuses it so the table can grow without having to recompute this 
statistic. The stored factor can be removed only by a `TRUNCATE` or a `DROP` 
table so even a subsequent `ANALYZE TABLE` where the calculation is disabled 
keeps the old value.
   
   Although this intended to be a generic solution for each columnar file 
formats this PR currently only focusing on the ORC file format.
   
   ### Does this PR introduce any user-facing change?
   
   No
   
   ### How was this patch tested?
   
   The StatisticsSuite is extended with a new test: `SPARK-24914 - test 
deserialization factor (ORC)` which checks:
   - the factor calculation and application 
   - keeping and using the old factor when the calculation is switched off
   - calculating it for multiple partitions
   - removing the factor at `TRUNCATE`


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] attilapiros opened a new pull request #26016: [SPARK-24914][SQL] New statistic to improve data size estimate for columnar storage formats

Reply via email to