[ 
https://issues.apache.org/jira/browse/PIG-2397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13172015#comment-13172015
 ] 

Jie Li commented on PIG-2397:
-----------------------------

bq. The variability in my numbers is pretty much completely due to delays in 
task scheduling (busy cluster). 

I see. We used a dedicated cluster (though it was on Amazon EC2).

bq. I find it hard to imagine that changing the split size by 8x didn't affect 
Hive performance

For Q1 over 100GB data, the table lineitem consists of 600 HDFS blocks (our 
default block size is 128MB), so 600 map tasks need 40 waves in our cluster (16 
map slots). If each task takes 2 seconds to set up, the total task setup time 
is 80 seconds. Compared to Hive's 2300 seconds it can be ignored. 
                
> Running TPC-H on Pig
> --------------------
>
>                 Key: PIG-2397
>                 URL: https://issues.apache.org/jira/browse/PIG-2397
>             Project: Pig
>          Issue Type: Task
>            Reporter: Jie Li
>         Attachments: TPC-H_on_Pig.tgz, pig_tpch.ppt
>
>
> For a class project we developed a whole set of Pig scripts for TPC-H. Our 
> goals are:
> 1) identifying the bottlenecks of Pig's performance especially of its 
> relational operators,
> 2) studying how to write efficient scripts by making full use of Pig Latin's 
> features,
> 3) comparing with Hive's TPC-H results for verifying both 1) and 2).
> We will update the JIRA with our scripts, results and analysis soon.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to