[ 
https://issues.apache.org/jira/browse/SPARK-24914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-24914:
----------------------------------
    Description: 
When determining whether to do a broadcast join, Spark estimates the size of 
the smaller table as follows:
 - if totalSize is defined and greater than 0, use it.
 - else, if rawDataSize is defined and greater than 0, use it
 - else, use spark.sql.defaultSizeInBytes (default: Long.MaxValue)

Therefore, Spark prefers totalSize over rawDataSize.

Unfortunately, totalSize is often quite a bit smaller than the actual table 
size, since it represents the size of the table's files on disk. Parquet and 
Orc files, for example, are encoded and compressed. This can result in the JVM 
throwing an OutOfMemoryError while Spark is loading the table into a 
HashedRelation, or when Spark actually attempts to broadcast the data.

On the other hand, rawDataSize represents the uncompressed size of the dataset, 
according to Hive documentation. This seems like a pretty good number to use in 
preference to totalSize. However, due to HIVE-20079, this value is simply 
#columns * #rows. Once that bug is fixed, it may be a superior statistic, at 
least for managed tables.

In the meantime, we could apply a configurable "fudge factor" to totalSize, at 
least for types of files that are encoded and compressed. Hive has the setting 
hive.stats.deserialization.factor, which defaults to 1.0, and is described as 
follows:
{quote}in the absence of uncompressed/raw data size, total file size will be 
used for statistics annotation. But the file may be compressed, encoded and 
serialized which may be lesser in size than the actual uncompressed/raw data 
size. This factor will be multiplied to file size to estimate the raw data size.
{quote}
In addition to the fudge factor, we could compare the adjusted totalSize to 
rawDataSize and use the bigger of the two.

Caveat: This mitigates the issue only for Hive tables. It does not help much 
when the user is reading files using {{spark.read.parquet}}, unless we apply 
the same fudge factor there.

  was:
When determining whether to do a broadcast join, Spark estimates the size of 
the smaller table as follows:
 - if totalSize is defined and greater than 0, use it.
 - else, if rawDataSize is defined and greater than 0, use it
 - else, use spark.sql.defaultSizeInBytes (default: Long.MaxValue)

Therefore, Spark prefers totalSize over rawDataSize.

Unfortunately, totalSize is often quite a bit smaller than the actual table 
size, since it represents the size of the table's files on disk. Parquet and 
Orc files, for example, are encoded and compressed. This can result in the JVM 
throwing an OutOfMemoryError while Spark is loading the table into a 
HashedRelation, or when Spark actually attempts to broadcast the data.

On the other hand, rawDataSize represents the uncompressed size of the dataset, 
according to Hive documentation. This seems like a pretty good number to use in 
preference to totalSize. However, due to HIVE-20079, this value is simply 
#columns * #rows. Once that bug is fixed, it may be a superior statistic, at 
least for managed tables.

In the meantime, we could apply a configurable "fudge factor" to totalSize, at 
least for types of files that are encoded and compressed. Hive has the setting 
hive.stats.deserialization.factor, which defaults to 1.0, and is described as 
follows:
{quote}in the absence of uncompressed/raw data size, total file size will be 
used for statistics annotation. But the file may be compressed, encoded and 
serialized which may be lesser in size than the actual uncompressed/raw data 
size. This factor will be multiplied to file size to estimate the raw data size.
{quote}
In addition to the fudge factor, we could compare the adjusted totalSize to 
rawDataSize and use the bigger of the two:
{noformat}
size1 = totalSize.isDefined ? totalSize * fudgeFactor : Long.MAX_VALUE
size2 = rawDataSize.isDefined ? rawDataSize : Long.MAX_VALUE
size = max(size1, size2)
{noformat}
Caveat: This mitigates the issue only for Hive tables. It does not help much 
when the user is reading files using {{spark.read.parquet}}, unless we apply 
the same fudge factor there.


> totalSize is not a good estimate for broadcast joins
> ----------------------------------------------------
>
>                 Key: SPARK-24914
>                 URL: https://issues.apache.org/jira/browse/SPARK-24914
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.4.0
>            Reporter: Bruce Robbins
>            Priority: Major
>
> When determining whether to do a broadcast join, Spark estimates the size of 
> the smaller table as follows:
>  - if totalSize is defined and greater than 0, use it.
>  - else, if rawDataSize is defined and greater than 0, use it
>  - else, use spark.sql.defaultSizeInBytes (default: Long.MaxValue)
> Therefore, Spark prefers totalSize over rawDataSize.
> Unfortunately, totalSize is often quite a bit smaller than the actual table 
> size, since it represents the size of the table's files on disk. Parquet and 
> Orc files, for example, are encoded and compressed. This can result in the 
> JVM throwing an OutOfMemoryError while Spark is loading the table into a 
> HashedRelation, or when Spark actually attempts to broadcast the data.
> On the other hand, rawDataSize represents the uncompressed size of the 
> dataset, according to Hive documentation. This seems like a pretty good 
> number to use in preference to totalSize. However, due to HIVE-20079, this 
> value is simply #columns * #rows. Once that bug is fixed, it may be a 
> superior statistic, at least for managed tables.
> In the meantime, we could apply a configurable "fudge factor" to totalSize, 
> at least for types of files that are encoded and compressed. Hive has the 
> setting hive.stats.deserialization.factor, which defaults to 1.0, and is 
> described as follows:
> {quote}in the absence of uncompressed/raw data size, total file size will be 
> used for statistics annotation. But the file may be compressed, encoded and 
> serialized which may be lesser in size than the actual uncompressed/raw data 
> size. This factor will be multiplied to file size to estimate the raw data 
> size.
> {quote}
> In addition to the fudge factor, we could compare the adjusted totalSize to 
> rawDataSize and use the bigger of the two.
> Caveat: This mitigates the issue only for Hive tables. It does not help much 
> when the user is reading files using {{spark.read.parquet}}, unless we apply 
> the same fudge factor there.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to