[
https://issues.apache.org/jira/browse/SPARK-24914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dongjoon Hyun updated SPARK-24914:
----------------------------------
Affects Version/s: (was: 3.0.0)
3.1.0
> totalSize is not a good estimate for broadcast joins
> ----------------------------------------------------
>
> Key: SPARK-24914
> URL: https://issues.apache.org/jira/browse/SPARK-24914
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 3.1.0
> Reporter: Bruce Robbins
> Priority: Major
>
> When determining whether to do a broadcast join, Spark estimates the size of
> the smaller table as follows:
> - if totalSize is defined and greater than 0, use it.
> - else, if rawDataSize is defined and greater than 0, use it
> - else, use spark.sql.defaultSizeInBytes (default: Long.MaxValue)
> Therefore, Spark prefers totalSize over rawDataSize.
> Unfortunately, totalSize is often quite a bit smaller than the actual table
> size, since it represents the size of the table's files on disk. Parquet and
> Orc files, for example, are encoded and compressed. This can result in the
> JVM throwing an OutOfMemoryError while Spark is loading the table into a
> HashedRelation, or when Spark actually attempts to broadcast the data.
> On the other hand, rawDataSize represents the uncompressed size of the
> dataset, according to Hive documentation. This seems like a pretty good
> number to use in preference to totalSize. However, due to HIVE-20079, this
> value is simply #columns * #rows. Once that bug is fixed, it may be a
> superior statistic, at least for managed tables.
> In the meantime, we could apply a configurable "fudge factor" to totalSize,
> at least for types of files that are encoded and compressed. Hive has the
> setting hive.stats.deserialization.factor, which defaults to 1.0, and is
> described as follows:
> {quote}in the absence of uncompressed/raw data size, total file size will be
> used for statistics annotation. But the file may be compressed, encoded and
> serialized which may be lesser in size than the actual uncompressed/raw data
> size. This factor will be multiplied to file size to estimate the raw data
> size.
> {quote}
> Also, I propose a configuration setting to allow the user to completely
> ignore rawDataSize, since that value is broken (due to HIVE-20079). When that
> configuration setting is set to true, Spark would instead estimate the table
> as follows:
> - if totalSize is defined and greater than 0, use totalSize*fudgeFactor.
> - else, use spark.sql.defaultSizeInBytes (default: Long.MaxValue)
> Caveat: This mitigates the issue only for Hive tables. It does not help much
> when the user is reading files using {{spark.read.parquet}}, unless we apply
> the same fudge factor there.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]