attilapiros commented on issue #26016: [SPARK-24914][SQL] New statistic to improve data size estimate for columnar storage formats URL: https://github.com/apache/spark/pull/26016#issuecomment-565411662 @maropu the `deserFactor` has some advantages: - It is not needed to always keep it up-to-date. A factor is more general than a number and when new partitions are added with similar distribution (or even with a different one but with worse compression) the factor can be safely reused. This way executing the ANALYZE TABLE is needed less frequently. - Setting it manually the user can switch off broadcast joins table by table (the opposite of the broadcast hint + having it configurable per table not for each table name occurrence within each queries where it is used). - Regarding ORC we have direct support for this factor. So calculating is expected to be very efficient.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
