pig-user  

RE: Pig performance

Olga Natkovich
Wed, 26 Mar 2008 15:53:13 -0700

Hi Travis,

There are a couple of things you can do to improve performance of your
script.

(1) At this point we have a pretty basic logic of when a combiner is
invoked. In the way your query is written now it would not be, however,
if you modify you foreach statement it will be:

RollUp = FOREACH HourGroups FLATTEN(group), COUNT(Raw);

You can see if the combiner is invoked by running

Explain RollUp.

(2) You do need to use parallel keyword on the group operator to make
sure it runs in parallel.

Finally, we are working on some performance improvements as part of
pipeline redesign. You can track the progress at
https://issues.apache.org/jira/browse/pig-157.

Olga

> -----Original Message-----
> From: Travis Brady [EMAIL PROTECTED] 
> Sent: Wednesday, March 26, 2008 2:03 PM
> To: pig-user@incubator.apache.org
> Subject: Pig performance
> 
> I really like writing pig code, but I'm experiencing pretty 
> terrible performance using Pig for a simple data rollup 
> taking about 90 minutes to complete.  The equivalent 
> expressed using shell scripts and Haskell and executed with 
> hadoop streaming runs in roughly 5 minutes.
> My dataset is stored on hdfs as a handful of tab delimited 
> text files.  In sum there are 19 million rows of data.
> 
> This is running on a 3-node cluster where each machine has 
> 8GB of ram.  I have all three machines configured per the 
> instructions on the Hadoop wiki on setting up Hadoop on Ubuntu.
> 
> Here is the pig code:
> <code>
> Raw = LOAD 'stats_dump_200707' USING PigStorage('\t');
> 
> HourGroups = GROUP Raw by $0;
> 
> RollUp = FOREACH HourGroups {
>     GENERATE FLATTEN(group), COUNT(Raw); }
> 
> DUMP RollUp;
> </code>
> 
> Do I need to add the PARALLEL keyword in there somewhere?  
> Change something in hadoop-site.xml?
> 
> The Hadoop streaming stuff uses "cut -c 1-13" as the mapper 
> and a bit of Haskell compiled with ghc as the reducer:
> I can send the Haskell code along if it would help, but for 
> now I assume I must be doing something wrong for it to 
> perform so poorly.
> 
> thank you
> 
> --
> Travis Brady
> www.mochiads.com
>