Travis Brady
Wed, 26 Mar 2008 14:03:42 -0700
I really like writing pig code, but I'm experiencing pretty terrible
performance using Pig for a simple data rollup taking about 90 minutes to
complete. The equivalent expressed using shell scripts and Haskell and
executed with hadoop streaming runs in roughly 5 minutes.
My dataset is stored on hdfs as a handful of tab delimited text files. In
sum there are 19 million rows of data.
This is running on a 3-node cluster where each machine has 8GB of ram. I
have all three machines configured per the instructions on the Hadoop wiki
on setting up Hadoop on Ubuntu.
Here is the pig code:
<code>
Raw = LOAD 'stats_dump_200707' USING PigStorage('\t');
HourGroups = GROUP Raw by $0;
RollUp = FOREACH HourGroups {
GENERATE FLATTEN(group), COUNT(Raw);
}
DUMP RollUp;
</code>
Do I need to add the PARALLEL keyword in there somewhere? Change something
in hadoop-site.xml?
The Hadoop streaming stuff uses "cut -c 1-13" as the mapper and a bit of
Haskell compiled with ghc as the reducer:
I can send the Haskell code along if it would help, but for now I assume I
must be doing something wrong for it to perform so poorly.
thank you
--
Travis Brady
www.mochiads.com