[ 
https://issues.apache.org/jira/browse/SPARK-20475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15985838#comment-15985838
 ] 

Zhenhua Wang edited comment on SPARK-20475 at 4/27/17 3:53 AM:
---------------------------------------------------------------

Spark also collects table size no matter "analyze" command is run or not.
Did you check the threshould in Hive for map side join? Is it the same as in 
Spark?
Besides, if you want to use broadcast join in spark, we have a broadcast hint 
in Spark.


was (Author: zenwzh):
Spark also collects table size no matter "analyze" command is run or not.
Did check the threshould in Hive for map side join? Is it the same as in Spark?
Besides, if you want to use broadcast join in spark, we have a broadcast hint 
in Spark.

> Whether use "broadcast join" depends on hive configuration
> ----------------------------------------------------------
>
>                 Key: SPARK-20475
>                 URL: https://issues.apache.org/jira/browse/SPARK-20475
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.1.0
>            Reporter: Lijia Liu
>
> Currently,  broadcast join in Spark only works while:
>   1. The value of "spark.sql.autoBroadcastJoinThreshold" bigger than 
> 0(default is 10485760).
>   2. The size of one of the hive tables less than 
> "spark.sql.autoBroadcastJoinThreshold". To get the size information of the 
> hive table from hive metastore,  "hive.stats.autogather" should be set to 
> true in hive or the command "ANALYZE TABLE <tableName> COMPUTE STATISTICS 
> noscan" has been run.
> But in Hive, it calculate the size of the file or directory corresponding to 
> the hive table to determine whether to use the map side join, and does not 
> depend on the hive metastore.
> This leads to two problems:
>   1. Spark will not use "broadcast join" when the hive parameter 
> "hive.stats.autogather" is not set to ture or the command "ANALYZE TABLE 
> <tableName> COMPUTE STATISTICS noscan" has not been run because the 
> information of the hive table has not saved in hive metastore . The mode of 
> work in Spark depends on the configuration of Hive.
>   2. For some reason, we set "hive.stats.autogather" to false in our Hive. 
> For the same SQL, Hive is 4 times faster than Spark because Hive used "map 
> side join" but Spark did not use "broadcast join".
> Is it possible to use the mechanism same to hive's to look up the size of a 
> hive tale in Spark.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to