pig-user  

Pig performance

Travis Brady
Wed, 26 Mar 2008 14:03:42 -0700

I really like writing pig code, but I'm experiencing pretty terrible
performance using Pig for a simple data rollup taking about 90 minutes to
complete.  The equivalent expressed using shell scripts and Haskell and
executed with hadoop streaming runs in roughly 5 minutes.
My dataset is stored on hdfs as a handful of tab delimited text files.  In
sum there are 19 million rows of data.

This is running on a 3-node cluster where each machine has 8GB of ram.  I
have all three machines configured per the instructions on the Hadoop wiki
on setting up Hadoop on Ubuntu.

Here is the pig code:
<code>
Raw = LOAD 'stats_dump_200707' USING PigStorage('\t');

HourGroups = GROUP Raw by $0;

RollUp = FOREACH HourGroups {
    GENERATE FLATTEN(group), COUNT(Raw);
}

DUMP RollUp;
</code>

Do I need to add the PARALLEL keyword in there somewhere?  Change something
in hadoop-site.xml?

The Hadoop streaming stuff uses "cut -c 1-13" as the mapper and a bit of
Haskell compiled with ghc as the reducer:
I can send the Haskell code along if it would help, but for now I assume I
must be doing something wrong for it to perform so poorly.

thank you

-- 
Travis Brady
www.mochiads.com