Lijia Liu created SPARK-20475:
---------------------------------

             Summary: Whether use "broadcast join" depends on hive configuration
                 Key: SPARK-20475
                 URL: https://issues.apache.org/jira/browse/SPARK-20475
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 2.1.0
            Reporter: Lijia Liu


Currently,  broadcast join in Spark only works while:
  1. The value of "spark.sql.autoBroadcastJoinThreshold" bigger than 0(default 
is 10485760).
  2. The size of one of the hive tables less than 
"spark.sql.autoBroadcastJoinThreshold". To get the size information of the hive 
table from hive metastore,  "hive.stats.autogather" should be set to true in 
hive or the command "ANALYZE TABLE <tableName> COMPUTE STATISTICS noscan" has 
been run.

But in Hive, it calculate the size of the file or directory corresponding to 
the hive table to determine whether to use the map side join, and does not 
depend on the hive metastore.

This leads to two problems:
  1. Spark will not use "broadcast join" when the hive parameter 
"hive.stats.autogather" is not set to ture or the command "ANALYZE TABLE 
<tableName> COMPUTE STATISTICS noscan" has not been run because the information 
of the hive table has not saved in hive metastore . The mode of work in Spark 
depends on the configuration of Hive.
  2. For some reason, we set "hive.stats.autogather" to false in our Hive. For 
the same SQL, Hive is 4 times faster than Spark because Hive used "map side 
join" but Spark did not use "broadcast join".

Is it possible to use the mechanism same to hive's to look up the size of a 
hive tale in Spark.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to