[jira] [Commented] (PIG-2397) Running TPC-H on Pig

Dmitriy V. Ryaboy (Commented) (JIRA) Thu, 15 Dec 2011 15:16:02 -0800

    [ 
https://issues.apache.org/jira/browse/PIG-2397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13170566#comment-13170566
 ]


Dmitriy V. Ryaboy commented on PIG-2397:
----------------------------------------

I was curious about the massive difference between what Jie was seeing for Hive 
and Pig on Q1, and did a little digging of my own.
I couldn't get the same difference in performance out of the box at all on my 
cluster -- Hive ranged between 160 and 240 seconds, while Pig ranged between 
290 and 350 (ish) on several runs of Q1. 

Digging in a little further, I think there are 3 things worth noting:
1) The hive TPC-H scripts set mapred.min.split.size=536870912 while Pig ones do 
not. This means Pig will pick up whatever the cluster defaults are, and the 
difference in # of mappers will be greatly exaggerated when running on small 
clusters incapable of running hundreds of tasks in parallel (task set-up costs 
will keep accumulating). I recommend this parameter be set to be the same as 
the one in Hive TPC-H in PIG-2397, for consistency.

2) We generate a sampling job for an ORDER-BY even when the parallelism of that 
operator is set to 1 (so sampling and custom partitioning is useless). That's 
just free performance gains, and comes up in many real-life cases, not just 
benchmarks. We should fix this and get 30 seconds per job back.

3) When the split sizes are comparable for TPC-H Q1, Hive's tasks finish in 
about 60 seconds on average, while Pig takes about 84 seconds. I believe this 
is due to the fact that Hive triggers in-mem aggregation and output based on 
memory utilization; we have a hardcoded MAX_SIZE_CURVAL_CACHE = 1024. In this 
particular case, that means Hive's tasks output 4 records (a single 
aggregation), while we output 28 (9 aggregations). If we make 
MAX_SIZE_CURVAL_CACHE configurable, or based on memory, we can probably improve 
performance for small records.

D
                
> Running TPC-H on Pig
> --------------------
>
>                 Key: PIG-2397
>                 URL: https://issues.apache.org/jira/browse/PIG-2397
>             Project: Pig
>          Issue Type: Task
>            Reporter: Jie Li
>         Attachments: TPC-H_on_Pig.tgz, pig_tpch.ppt
>
>
> For a class project we developed a whole set of Pig scripts for TPC-H. Our 
> goals are:
> 1) identifying the bottlenecks of Pig's performance especially of its 
> relational operators,
> 2) studying how to write efficient scripts by making full use of Pig Latin's 
> features,
> 3) comparing with Hive's TPC-H results for verifying both 1) and 2).
> We will update the JIRA with our scripts, results and analysis soon.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-2397) Running TPC-H on Pig

Reply via email to