[ https://issues.apache.org/jira/browse/PIG-2397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13170566#comment-13170566 ]
Dmitriy V. Ryaboy commented on PIG-2397: ---------------------------------------- I was curious about the massive difference between what Jie was seeing for Hive and Pig on Q1, and did a little digging of my own. I couldn't get the same difference in performance out of the box at all on my cluster -- Hive ranged between 160 and 240 seconds, while Pig ranged between 290 and 350 (ish) on several runs of Q1. Digging in a little further, I think there are 3 things worth noting: 1) The hive TPC-H scripts set mapred.min.split.size=536870912 while Pig ones do not. This means Pig will pick up whatever the cluster defaults are, and the difference in # of mappers will be greatly exaggerated when running on small clusters incapable of running hundreds of tasks in parallel (task set-up costs will keep accumulating). I recommend this parameter be set to be the same as the one in Hive TPC-H in PIG-2397, for consistency. 2) We generate a sampling job for an ORDER-BY even when the parallelism of that operator is set to 1 (so sampling and custom partitioning is useless). That's just free performance gains, and comes up in many real-life cases, not just benchmarks. We should fix this and get 30 seconds per job back. 3) When the split sizes are comparable for TPC-H Q1, Hive's tasks finish in about 60 seconds on average, while Pig takes about 84 seconds. I believe this is due to the fact that Hive triggers in-mem aggregation and output based on memory utilization; we have a hardcoded MAX_SIZE_CURVAL_CACHE = 1024. In this particular case, that means Hive's tasks output 4 records (a single aggregation), while we output 28 (9 aggregations). If we make MAX_SIZE_CURVAL_CACHE configurable, or based on memory, we can probably improve performance for small records. D > Running TPC-H on Pig > -------------------- > > Key: PIG-2397 > URL: https://issues.apache.org/jira/browse/PIG-2397 > Project: Pig > Issue Type: Task > Reporter: Jie Li > Attachments: TPC-H_on_Pig.tgz, pig_tpch.ppt > > > For a class project we developed a whole set of Pig scripts for TPC-H. Our > goals are: > 1) identifying the bottlenecks of Pig's performance especially of its > relational operators, > 2) studying how to write efficient scripts by making full use of Pig Latin's > features, > 3) comparing with Hive's TPC-H results for verifying both 1) and 2). > We will update the JIRA with our scripts, results and analysis soon. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira