[ 
https://issues.apache.org/jira/browse/HIVE-9697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xin Hao updated HIVE-9697:
--------------------------
    Description: 
We have a finding during running some Big-Bench cases:
when the same small table size threshold is used, Map Join operator will not be 
generated in Stage Plans for Hive on Spark, while will be generated for Hive on 
MR.

For example, When we run BigBench Q25, the meta info of one input ORC table is 
as below:
    totalSize=1748955 (about 1.5M)
    rawDataSize=123050375 (about 120M)
If we use the following parameter settings,
    set hive.auto.convert.join=true;
    set hive.mapjoin.smalltable.filesize=25000000;
    set hive.auto.convert.join.noconditionaltask=true;
    set hive.auto.convert.join.noconditionaltask.size=100000000; (100M)
Map Join will be enabled for Hive on MR mode, while will not be enabled for 
Hive on Spark.

We found that for Hive on MR, the HDFS file size for the table 
(ContentSummary.getLength(), should approximate the value of ‘totalSize’) will 
be used to compare with the threshold 100M (smaller than 100M), while for Hive 
on Spark 'rawDataSize' will be used to compare with the threshold 100M (larger 
than 100M). That's why MapJoin is not enabled for Hive on Spark for this case. 
And as a result Hive on Spark will get much lower performance data than Hive on 
MR for this case.

When we set  hive.auto.convert.join.noconditionaltask.size=150000000; (150M), 
MapJoin will be enabled for Hive on Spark mode, and Hive on Spark will have 
similar performance data with Hive on MR by then.


  was:
We have a finding during running some Big-Bench cases:
when the same small table size threshold is used, Map Join operator will not be 
generated in Stage Plans for Hive on Spark, while will be generated for Hive on 
MR.

For example, When we run BigBench Q25, the meta info of one input ORC table is 
as below:
    totalSize=1748955 (about 1.5M)
    rawDataSize=123050375 (about 120M)
If we use the following parameter settings,
    set hive.auto.convert.join=true;
    set hive.mapjoin.smalltable.filesize=25000000;
    set hive.auto.convert.join.noconditionaltask=true;
    set hive.auto.convert.join.noconditionaltask.size=100000000; (100M)
Map Join will be enabled for Hive on MR mode, while will not be enabled for 
Hive on Spark.

We found that for Hive on MR, 'totalSize' will be used to compare with the 
threshold 100M ('totalSize' is about 1.5M and smaller than 100M), while for 
Hive on Spark 'rawDataSize' will be used to compare with the threshold 
('rawDataSize' is about 120M and larger than 100M). That's why MapJoin is not 
enabled for Hive on Spark for this case. And as a result Hive on Spark will get 
much lower performance data than Hive on MR for this case.

When we set  hive.auto.convert.join.noconditionaltask.size=150000000; (150M), 
MapJoin will be enabled for Hive on Spark mode, and Hive on Spark will have 
similar performance data with Hive on MR by then.



> Hive on Spark is not as aggressive as MR on map join [Spark Branch]
> -------------------------------------------------------------------
>
>                 Key: HIVE-9697
>                 URL: https://issues.apache.org/jira/browse/HIVE-9697
>             Project: Hive
>          Issue Type: Sub-task
>          Components: Spark
>            Reporter: Xin Hao
>
> We have a finding during running some Big-Bench cases:
> when the same small table size threshold is used, Map Join operator will not 
> be generated in Stage Plans for Hive on Spark, while will be generated for 
> Hive on MR.
> For example, When we run BigBench Q25, the meta info of one input ORC table 
> is as below:
>     totalSize=1748955 (about 1.5M)
>     rawDataSize=123050375 (about 120M)
> If we use the following parameter settings,
>     set hive.auto.convert.join=true;
>     set hive.mapjoin.smalltable.filesize=25000000;
>     set hive.auto.convert.join.noconditionaltask=true;
>     set hive.auto.convert.join.noconditionaltask.size=100000000; (100M)
> Map Join will be enabled for Hive on MR mode, while will not be enabled for 
> Hive on Spark.
> We found that for Hive on MR, the HDFS file size for the table 
> (ContentSummary.getLength(), should approximate the value of ‘totalSize’) 
> will be used to compare with the threshold 100M (smaller than 100M), while 
> for Hive on Spark 'rawDataSize' will be used to compare with the threshold 
> 100M (larger than 100M). That's why MapJoin is not enabled for Hive on Spark 
> for this case. And as a result Hive on Spark will get much lower performance 
> data than Hive on MR for this case.
> When we set  hive.auto.convert.join.noconditionaltask.size=150000000; (150M), 
> MapJoin will be enabled for Hive on Spark mode, and Hive on Spark will have 
> similar performance data with Hive on MR by then.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to