pig-user  

Re: Pig performance

Travis Brady
Wed, 26 Mar 2008 17:28:08 -0700

Hi Olga,

Thanks for your help and thank you for open sourcing Pig.
I converted the FOREACH statement but it yields a very lengthy error the
relevant portion of which I've pasted below.

The good news is the modified FOREACH was going to invoke the combiner.

2008-03-26 16:16:00,322 [main] ERROR
org.apache.pig.backend.hadoop.executionengine.mapreduceExec.MapReduceLauncher-
Error message from task (map) tip_200803261041_0008_m_000000
2008-03-26 16:16:00,322 [main] ERROR
org.apache.pig.backend.hadoop.executionengine.mapreduceExec.MapReduceLauncher-
Error message from task (map) tip_200803261041_0008_m_000002
java.lang.OutOfMemoryError: Java heap space
    at java.util.Arrays.copyOf(Arrays.java:2786)
    at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:71)
    at java.io.DataOutputStream.write(DataOutputStream.java:71)
    at org.apache.pig.data.Tuple.encodeInt(Tuple.java:362)
    at org.apache.pig.data.DataAtom.write(DataAtom.java:137)
    at org.apache.pig.data.Tuple.write(Tuple.java:301)
    at org.apache.pig.data.IndexedTuple.write(IndexedTuple.java:52)
    at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java
:392)
    at
org.apache.pig.backend.hadoop.executionengine.mapreduceExec.PigMapReduce$MapDataOutputCollector.add
(PigMapReduce.java:304)
    at org.apache.pig.impl.eval.collector.UnflattenCollector.add(
UnflattenCollector.java:52)
    at org.apache.pig.impl.eval.GenerateSpec$CrossProductItem.add(
GenerateSpec.java:230)
    at org.apache.pig.impl.eval.collector.UnflattenCollector.add(
UnflattenCollector.java:52)
    at org.apache.pig.impl.eval.collector.DataCollector.addToSuccessor(
DataCollector.java:93)
    at org.apache.pig.impl.eval.SimpleEvalSpec$1.add(SimpleEvalSpec.java:35)
    at org.apache.pig.impl.eval.GenerateSpec$CrossProductItem.exec(
GenerateSpec.java:261)
    at org.apache.pig.impl.eval.GenerateSpec$1.add(GenerateSpec.java:86)
    at org.apache.pig.impl.eval.collector.UnflattenCollector.add(
UnflattenCollector.java:52)
    at
org.apache.pig.backend.hadoop.executionengine.mapreduceExec.PigMapReduce.run
(PigMapReduce.java:113)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:208)
    at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java
:2084)


Travis


On Wed, Mar 26, 2008 at 3:51 PM, Olga Natkovich <[EMAIL PROTECTED]> wrote:

> Hi Travis,
>
> There are a couple of things you can do to improve performance of your
> script.
>
> (1) At this point we have a pretty basic logic of when a combiner is
> invoked. In the way your query is written now it would not be, however,
> if you modify you foreach statement it will be:
>
> RollUp = FOREACH HourGroups FLATTEN(group), COUNT(Raw);
>
> You can see if the combiner is invoked by running
>
> Explain RollUp.
>
> (2) You do need to use parallel keyword on the group operator to make
> sure it runs in parallel.
>
> Finally, we are working on some performance improvements as part of
> pipeline redesign. You can track the progress at
> https://issues.apache.org/jira/browse/pig-157.
>
> Olga
>
> > -----Original Message-----
> > From: Travis Brady [EMAIL PROTECTED]
> > Sent: Wednesday, March 26, 2008 2:03 PM
> > To: pig-user@incubator.apache.org
> > Subject: Pig performance
> >
> > I really like writing pig code, but I'm experiencing pretty
> > terrible performance using Pig for a simple data rollup
> > taking about 90 minutes to complete.  The equivalent
> > expressed using shell scripts and Haskell and executed with
> > hadoop streaming runs in roughly 5 minutes.
> > My dataset is stored on hdfs as a handful of tab delimited
> > text files.  In sum there are 19 million rows of data.
> >
> > This is running on a 3-node cluster where each machine has
> > 8GB of ram.  I have all three machines configured per the
> > instructions on the Hadoop wiki on setting up Hadoop on Ubuntu.
> >
> > Here is the pig code:
> > <code>
> > Raw = LOAD 'stats_dump_200707' USING PigStorage('\t');
> >
> > HourGroups = GROUP Raw by $0;
> >
> > RollUp = FOREACH HourGroups {
> >     GENERATE FLATTEN(group), COUNT(Raw); }
> >
> > DUMP RollUp;
> > </code>
> >
> > Do I need to add the PARALLEL keyword in there somewhere?
> > Change something in hadoop-site.xml?
> >
> > The Hadoop streaming stuff uses "cut -c 1-13" as the mapper
> > and a bit of Haskell compiled with ghc as the reducer:
> > I can send the Haskell code along if it would help, but for
> > now I assume I must be doing something wrong for it to
> > perform so poorly.
> >
> > thank you
> >
> > --
> > Travis Brady
> > www.mochiads.com
> >
>



-- 
Travis Brady
www.mochiads.com