[ https://issues.apache.org/jira/browse/SPARK-24914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Dongjoon Hyun updated SPARK-24914: ---------------------------------- Affects Version/s: (was: 2.4.0) 3.0.0 > totalSize is not a good estimate for broadcast joins > ---------------------------------------------------- > > Key: SPARK-24914 > URL: https://issues.apache.org/jira/browse/SPARK-24914 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 3.0.0 > Reporter: Bruce Robbins > Priority: Major > > When determining whether to do a broadcast join, Spark estimates the size of > the smaller table as follows: > - if totalSize is defined and greater than 0, use it. > - else, if rawDataSize is defined and greater than 0, use it > - else, use spark.sql.defaultSizeInBytes (default: Long.MaxValue) > Therefore, Spark prefers totalSize over rawDataSize. > Unfortunately, totalSize is often quite a bit smaller than the actual table > size, since it represents the size of the table's files on disk. Parquet and > Orc files, for example, are encoded and compressed. This can result in the > JVM throwing an OutOfMemoryError while Spark is loading the table into a > HashedRelation, or when Spark actually attempts to broadcast the data. > On the other hand, rawDataSize represents the uncompressed size of the > dataset, according to Hive documentation. This seems like a pretty good > number to use in preference to totalSize. However, due to HIVE-20079, this > value is simply #columns * #rows. Once that bug is fixed, it may be a > superior statistic, at least for managed tables. > In the meantime, we could apply a configurable "fudge factor" to totalSize, > at least for types of files that are encoded and compressed. Hive has the > setting hive.stats.deserialization.factor, which defaults to 1.0, and is > described as follows: > {quote}in the absence of uncompressed/raw data size, total file size will be > used for statistics annotation. But the file may be compressed, encoded and > serialized which may be lesser in size than the actual uncompressed/raw data > size. This factor will be multiplied to file size to estimate the raw data > size. > {quote} > Also, I propose a configuration setting to allow the user to completely > ignore rawDataSize, since that value is broken (due to HIVE-20079). When that > configuration setting is set to true, Spark would instead estimate the table > as follows: > - if totalSize is defined and greater than 0, use totalSize*fudgeFactor. > - else, use spark.sql.defaultSizeInBytes (default: Long.MaxValue) > Caveat: This mitigates the issue only for Hive tables. It does not help much > when the user is reading files using {{spark.read.parquet}}, unless we apply > the same fudge factor there. -- This message was sent by Atlassian JIRA (v7.6.14#76016) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org