GitHub user jinxing64 opened a pull request:
https://github.com/apache/spark/pull/19560
[SPARK-22334][SQL] Check table size from HDFS in case the size in metastore
is wrong.
## What changes were proposed in this pull request?
Currently we use table properties('totalSize') to judge if to use broadcast
join. Table properties are from metastore. However they can be wrong. Hive
sometimes fails to update table properties after producing data
successfully(e,g, NameNode timeout from
https://github.com/apache/hive/blob/branch-1.2/metastore/src/java/org/apache/hadoop/hive/metastore/MetaStoreUtils.java#L180).
If 'totalSize' in table properties is much smaller than its real size on HDFS,
Spark can launch broadcast join by mistake and suffer OOM.
Could we add a defense config and check the size from HDFS when 'totalSize'
is below `spark.sql.autoBroadcastJoinThreshold`
## How was this patch tested?
Tests added
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/jinxing64/spark SPARK-22334
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/19560.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #19560
----
commit 98975adf129c2359156e97e338f0e3e4f623372b
Author: jinxing <[email protected]>
Date: 2017-10-23T13:02:42Z
Check table size from HDFS in case the size in metastore is wrong.
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]