Hi all,

Yuntao Jia, our intern this summer, did a simple performance benchmark for 
Hadoop, Hive and Pig based on the queries in the SIGMOD 2009 paper: A 
Comparison of Approaches to Large-Scale Data Analysis

The report and the performance test kit are both attached here:
http://issues.apache.org/jira/browse/HIVE-396


We tried our best to get good performance out of Hive and Pig, and we keep the 
hadoop program as close as it is from the SIGMOD paper.  We welcome all 
suggestions on how we can improve the performance more by both changing the 
configuration or improving the code.


While we tried our best to be fair, system settings and environments do affect 
the result a lot.  So we encourage everybody to try out the performance test 
kit on their own cluster, and we will appreciate if everybody can share their 
results.


Here is the summary.  The details are in the report 
hive_benchmark_2009-06-18.pdf from the link above.

Query: GREP SELECT
Hadoop: 136.1s
Hive:   125.4s
Pig:    247.8s

Query: RANKINGS SELECT
Hadoop: 26.1s
Hive:   31.0s
Pig:    38.4s

Query: USERVISITS AGGREGATION
Hadoop: 533.8s
Hive:   768.8s
Pig:    855.4s

Query: RANKINGS USERVISITS JOIN
Hadoop: 470.0s
Hive:   471.3s
Pig:    763.9s

Please take a look at hive_benchmark_2009-06-18.pdf from the link above for 
details. Let's keep discussions on 
http://issues.apache.org/jira/browse/HIVE-396 so it's easier to keep track.


Zheng

Reply via email to