pig-user  

Re: Pig performance

pi song
Wed, 26 Mar 2008 18:58:44 -0700

This is obviously a memory management problem that I'm investigating.
Travis, when did you download Pig? After the PIG-18 commit, I find this
problem occurs less often. Can you get the latest version and try again?

Also, a few days ago, we have identified an issue in the memory manager.
Though Ben hasn't come up with a patch yet. If the above still doesn't work,
could you please try this out?

1.  Look at SpillableMemoryManager.java
2.  change the code that looks like this
"
*if* (o1Size == o2Size) {
*return* 0;
}
*if* (o1Size < o2Size) {
*return* -1;
}
*return* 1;
}
"

to

"
*if* (o1Size == o2Size) {
*return* 0;
}
*if* (o1Size < o2Size) {
*return* 1;
}
*return* -1;
}
"

(Basically just swap 1 with -1)
Hope that will help

Pi


On 3/27/08, Travis Brady <[EMAIL PROTECTED]> wrote:
>
> Hi Olga,
>
> Thanks for your help and thank you for open sourcing Pig.
> I converted the FOREACH statement but it yields a very lengthy error the
> relevant portion of which I've pasted below.
>
> The good news is the modified FOREACH was going to invoke the combiner.
>
> 2008-03-26 16:16:00,322 [main] ERROR
>
> org.apache.pig.backend.hadoop.executionengine.mapreduceExec.MapReduceLauncher
> -
> Error message from task (map) tip_200803261041_0008_m_000000
> 2008-03-26 16:16:00,322 [main] ERROR
>
> org.apache.pig.backend.hadoop.executionengine.mapreduceExec.MapReduceLauncher
> -
> Error message from task (map) tip_200803261041_0008_m_000002
> java.lang.OutOfMemoryError: Java heap space
>    at java.util.Arrays.copyOf(Arrays.java:2786)
>    at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:71)
>    at java.io.DataOutputStream.write(DataOutputStream.java:71)
>    at org.apache.pig.data.Tuple.encodeInt(Tuple.java:362)
>    at org.apache.pig.data.DataAtom.write(DataAtom.java:137)
>    at org.apache.pig.data.Tuple.write(Tuple.java:301)
>    at org.apache.pig.data.IndexedTuple.write(IndexedTuple.java:52)
>    at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(
> MapTask.java
> :392)
>    at
>
> org.apache.pig.backend.hadoop.executionengine.mapreduceExec.PigMapReduce$MapDataOutputCollector.add
> (PigMapReduce.java:304)
>    at org.apache.pig.impl.eval.collector.UnflattenCollector.add(
> UnflattenCollector.java:52)
>    at org.apache.pig.impl.eval.GenerateSpec$CrossProductItem.add(
> GenerateSpec.java:230)
>    at org.apache.pig.impl.eval.collector.UnflattenCollector.add(
> UnflattenCollector.java:52)
>    at org.apache.pig.impl.eval.collector.DataCollector.addToSuccessor(
> DataCollector.java:93)
>    at org.apache.pig.impl.eval.SimpleEvalSpec$1.add(SimpleEvalSpec.java
> :35)
>    at org.apache.pig.impl.eval.GenerateSpec$CrossProductItem.exec(
> GenerateSpec.java:261)
>    at org.apache.pig.impl.eval.GenerateSpec$1.add(GenerateSpec.java:86)
>    at org.apache.pig.impl.eval.collector.UnflattenCollector.add(
> UnflattenCollector.java:52)
>    at
>
> org.apache.pig.backend.hadoop.executionengine.mapreduceExec.PigMapReduce.run
> (PigMapReduce.java:113)
>    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:208)
>    at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java
> :2084)
>
>
> Travis
>
>
> On Wed, Mar 26, 2008 at 3:51 PM, Olga Natkovich <[EMAIL PROTECTED]>
> wrote:
>
> > Hi Travis,
> >
> > There are a couple of things you can do to improve performance of your
> > script.
> >
> > (1) At this point we have a pretty basic logic of when a combiner is
> > invoked. In the way your query is written now it would not be, however,
> > if you modify you foreach statement it will be:
> >
> > RollUp = FOREACH HourGroups FLATTEN(group), COUNT(Raw);
> >
> > You can see if the combiner is invoked by running
> >
> > Explain RollUp.
> >
> > (2) You do need to use parallel keyword on the group operator to make
> > sure it runs in parallel.
> >
> > Finally, we are working on some performance improvements as part of
> > pipeline redesign. You can track the progress at
> > https://issues.apache.org/jira/browse/pig-157.
> >
> > Olga
> >
> > > -----Original Message-----
> > > From: Travis Brady [EMAIL PROTECTED]
> > > Sent: Wednesday, March 26, 2008 2:03 PM
> > > To: pig-user@incubator.apache.org
> > > Subject: Pig performance
> > >
> > > I really like writing pig code, but I'm experiencing pretty
> > > terrible performance using Pig for a simple data rollup
> > > taking about 90 minutes to complete.  The equivalent
> > > expressed using shell scripts and Haskell and executed with
> > > hadoop streaming runs in roughly 5 minutes.
> > > My dataset is stored on hdfs as a handful of tab delimited
> > > text files.  In sum there are 19 million rows of data.
> > >
> > > This is running on a 3-node cluster where each machine has
> > > 8GB of ram.  I have all three machines configured per the
> > > instructions on the Hadoop wiki on setting up Hadoop on Ubuntu.
> > >
> > > Here is the pig code:
> > > <code>
> > > Raw = LOAD 'stats_dump_200707' USING PigStorage('\t');
> > >
> > > HourGroups = GROUP Raw by $0;
> > >
> > > RollUp = FOREACH HourGroups {
> > >     GENERATE FLATTEN(group), COUNT(Raw); }
> > >
> > > DUMP RollUp;
> > > </code>
> > >
> > > Do I need to add the PARALLEL keyword in there somewhere?
> > > Change something in hadoop-site.xml?
> > >
> > > The Hadoop streaming stuff uses "cut -c 1-13" as the mapper
> > > and a bit of Haskell compiled with ghc as the reducer:
> > > I can send the Haskell code along if it would help, but for
> > > now I assume I must be doing something wrong for it to
> > > perform so poorly.
> > >
> > > thank you
> > >
> > > --
> > > Travis Brady
> > > www.mochiads.com
> > >
> >
>
>
>
> --
> Travis Brady
> www.mochiads.com
>