Lijia Liu created SPARK-20475:
---------------------------------
Summary: Whether use "broadcast join" depends on hive configuration
Key: SPARK-20475
URL: https://issues.apache.org/jira/browse/SPARK-20475
Project: Spark
Issue Type: Improvement
Components: SQL
Affects Versions: 2.1.0
Reporter: Lijia Liu
Currently, broadcast join in Spark only works while:
1. The value of "spark.sql.autoBroadcastJoinThreshold" bigger than 0(default
is 10485760).
2. The size of one of the hive tables less than
"spark.sql.autoBroadcastJoinThreshold". To get the size information of the hive
table from hive metastore, "hive.stats.autogather" should be set to true in
hive or the command "ANALYZE TABLE <tableName> COMPUTE STATISTICS noscan" has
been run.
But in Hive, it calculate the size of the file or directory corresponding to
the hive table to determine whether to use the map side join, and does not
depend on the hive metastore.
This leads to two problems:
1. Spark will not use "broadcast join" when the hive parameter
"hive.stats.autogather" is not set to ture or the command "ANALYZE TABLE
<tableName> COMPUTE STATISTICS noscan" has not been run because the information
of the hive table has not saved in hive metastore . The mode of work in Spark
depends on the configuration of Hive.
2. For some reason, we set "hive.stats.autogather" to false in our Hive. For
the same SQL, Hive is 4 times faster than Spark because Hive used "map side
join" but Spark did not use "broadcast join".
Is it possible to use the mechanism same to hive's to look up the size of a
hive tale in Spark.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]