[
https://issues.apache.org/jira/browse/HIVE-13634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Xin Hao updated HIVE-13634:
---------------------------
Description:
Hive-on-Spark performed worse than Hive-on-MR, for queries with external
scripts.
For TPCx-BB Q2/Q3/Q4, they are Python Streaming related cases and will call
external scripts to handle reduce tasks. We found that for these 3 queries
Hive-on-Spark shows lower performance than Hive-on-MR when processing reduce
tasks with external (Python) scripts. So ‘Improve HoS performance for queries
with external scripts’ seems a performance optimization opportunity.
The following shows the Q2/Q3/Q4 test result on 8-worker-node cluster with
TPCx-BB 3TB data size.
TPCx-BB Query 2
(1)Hive-on-MR
Total Query Execution Time (sec): 2172.180
Execution Time of External Scripts (sec): 736
(2)Hive-on-Spark
Total Query Execution Time (sec): 2283.604
Execution Time of External Scripts (sec): 1197
TPCx-BB Query 3
(1)Hive-on-MR
Total Query Execution Time (sec): 1070.632
Execution Time of External Scripts (sec): 513
(2)Hive-on-Spark
Total Query Execution Time (sec): 1287.679
Execution Time of External Scripts (sec): 919
TPCx-BB Query 4
(1)Hive-on-MR
Total Query Execution Time (sec): 1781.864
Execution Time of External Scripts (sec): 1518
(2)Hive-on-Spark
Total Query Execution Time (sec): 2028.023
Execution Time of External Scripts (sec): 1599
was:
Hive-on-Spark performed worse than Hive-on-MR, for queries with external
scripts.
For TPCx-BB Q2/Q3/Q4, they are Python Streaming related cases and will call
external scripts to handle reduce tasks. We found that for these 3 queries
Hive-on-Spark shows lower performance than Hive-on-MR when processing reduce
tasks with external (Python) scripts. So ‘Improve HoS performance for queries
with external scripts’ seems a performance optimization opportunity.
> Hive-on-Spark performed worse than Hive-on-MR, for queries with external
> scripts
> --------------------------------------------------------------------------------
>
> Key: HIVE-13634
> URL: https://issues.apache.org/jira/browse/HIVE-13634
> Project: Hive
> Issue Type: Bug
> Reporter: Xin Hao
>
> Hive-on-Spark performed worse than Hive-on-MR, for queries with external
> scripts.
> For TPCx-BB Q2/Q3/Q4, they are Python Streaming related cases and will call
> external scripts to handle reduce tasks. We found that for these 3 queries
> Hive-on-Spark shows lower performance than Hive-on-MR when processing reduce
> tasks with external (Python) scripts. So ‘Improve HoS performance for queries
> with external scripts’ seems a performance optimization opportunity.
> The following shows the Q2/Q3/Q4 test result on 8-worker-node cluster with
> TPCx-BB 3TB data size.
> TPCx-BB Query 2
> (1)Hive-on-MR
> Total Query Execution Time (sec): 2172.180
> Execution Time of External Scripts (sec): 736
> (2)Hive-on-Spark
> Total Query Execution Time (sec): 2283.604
> Execution Time of External Scripts (sec): 1197
> TPCx-BB Query 3
> (1)Hive-on-MR
> Total Query Execution Time (sec): 1070.632
> Execution Time of External Scripts (sec): 513
> (2)Hive-on-Spark
> Total Query Execution Time (sec): 1287.679
> Execution Time of External Scripts (sec): 919
> TPCx-BB Query 4
> (1)Hive-on-MR
> Total Query Execution Time (sec): 1781.864
> Execution Time of External Scripts (sec): 1518
> (2)Hive-on-Spark
> Total Query Execution Time (sec): 2028.023
> Execution Time of External Scripts (sec): 1599
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)