[ 
https://issues.apache.org/jira/browse/HIVE-9697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14365114#comment-14365114
 ] 

Xuefu Zhang commented on HIVE-9697:
-----------------------------------

It seems that we all agree that rawDataSize is more practical for Spark. Could 
anyone give a summary on if it's the default or how to make it as default? If 
code change is required, we can propose a patch here. Thanks.

> Hive on Spark is not as aggressive as MR on map join [Spark Branch]
> -------------------------------------------------------------------
>
>                 Key: HIVE-9697
>                 URL: https://issues.apache.org/jira/browse/HIVE-9697
>             Project: Hive
>          Issue Type: Sub-task
>          Components: Spark
>            Reporter: Xin Hao
>
> We have a finding during running some Big-Bench cases:
> when the same small table size threshold is used, Map Join operator will not 
> be generated in Stage Plans for Hive on Spark, while will be generated for 
> Hive on MR.
> For example, When we run BigBench Q25, the meta info of one input ORC table 
> is as below:
>     totalSize=1748955 (about 1.5M)
>     rawDataSize=123050375 (about 120M)
> If we use the following parameter settings,
>     set hive.auto.convert.join=true;
>     set hive.mapjoin.smalltable.filesize=25000000;
>     set hive.auto.convert.join.noconditionaltask=true;
>     set hive.auto.convert.join.noconditionaltask.size=100000000; (100M)
> Map Join will be enabled for Hive on MR mode, while will not be enabled for 
> Hive on Spark.
> We found that for Hive on MR, the HDFS file size for the table 
> (ContentSummary.getLength(), should approximate the value of ‘totalSize’) 
> will be used to compare with the threshold 100M (smaller than 100M), while 
> for Hive on Spark 'rawDataSize' will be used to compare with the threshold 
> 100M (larger than 100M). That's why MapJoin is not enabled for Hive on Spark 
> for this case. And as a result Hive on Spark will get much lower performance 
> data than Hive on MR for this case.
> When we set  hive.auto.convert.join.noconditionaltask.size=150000000; (150M), 
> MapJoin will be enabled for Hive on Spark mode, and Hive on Spark will have 
> similar performance data with Hive on MR by then.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to