Travis Brady
Wed, 26 Mar 2008 17:28:08 -0700
Hi Olga,
Thanks for your help and thank you for open sourcing Pig.
I converted the FOREACH statement but it yields a very lengthy error the
relevant portion of which I've pasted below.
The good news is the modified FOREACH was going to invoke the combiner.
2008-03-26 16:16:00,322 [main] ERROR
org.apache.pig.backend.hadoop.executionengine.mapreduceExec.MapReduceLauncher-
Error message from task (map) tip_200803261041_0008_m_000000
2008-03-26 16:16:00,322 [main] ERROR
org.apache.pig.backend.hadoop.executionengine.mapreduceExec.MapReduceLauncher-
Error message from task (map) tip_200803261041_0008_m_000002
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2786)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:71)
at java.io.DataOutputStream.write(DataOutputStream.java:71)
at org.apache.pig.data.Tuple.encodeInt(Tuple.java:362)
at org.apache.pig.data.DataAtom.write(DataAtom.java:137)
at org.apache.pig.data.Tuple.write(Tuple.java:301)
at org.apache.pig.data.IndexedTuple.write(IndexedTuple.java:52)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java
:392)
at
org.apache.pig.backend.hadoop.executionengine.mapreduceExec.PigMapReduce$MapDataOutputCollector.add
(PigMapReduce.java:304)
at org.apache.pig.impl.eval.collector.UnflattenCollector.add(
UnflattenCollector.java:52)
at org.apache.pig.impl.eval.GenerateSpec$CrossProductItem.add(
GenerateSpec.java:230)
at org.apache.pig.impl.eval.collector.UnflattenCollector.add(
UnflattenCollector.java:52)
at org.apache.pig.impl.eval.collector.DataCollector.addToSuccessor(
DataCollector.java:93)
at org.apache.pig.impl.eval.SimpleEvalSpec$1.add(SimpleEvalSpec.java:35)
at org.apache.pig.impl.eval.GenerateSpec$CrossProductItem.exec(
GenerateSpec.java:261)
at org.apache.pig.impl.eval.GenerateSpec$1.add(GenerateSpec.java:86)
at org.apache.pig.impl.eval.collector.UnflattenCollector.add(
UnflattenCollector.java:52)
at
org.apache.pig.backend.hadoop.executionengine.mapreduceExec.PigMapReduce.run
(PigMapReduce.java:113)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:208)
at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java
:2084)
Travis
On Wed, Mar 26, 2008 at 3:51 PM, Olga Natkovich <[EMAIL PROTECTED]> wrote:
> Hi Travis,
>
> There are a couple of things you can do to improve performance of your
> script.
>
> (1) At this point we have a pretty basic logic of when a combiner is
> invoked. In the way your query is written now it would not be, however,
> if you modify you foreach statement it will be:
>
> RollUp = FOREACH HourGroups FLATTEN(group), COUNT(Raw);
>
> You can see if the combiner is invoked by running
>
> Explain RollUp.
>
> (2) You do need to use parallel keyword on the group operator to make
> sure it runs in parallel.
>
> Finally, we are working on some performance improvements as part of
> pipeline redesign. You can track the progress at
> https://issues.apache.org/jira/browse/pig-157.
>
> Olga
>
> > -----Original Message-----
> > From: Travis Brady [EMAIL PROTECTED]
> > Sent: Wednesday, March 26, 2008 2:03 PM
> > To: pig-user@incubator.apache.org
> > Subject: Pig performance
> >
> > I really like writing pig code, but I'm experiencing pretty
> > terrible performance using Pig for a simple data rollup
> > taking about 90 minutes to complete. The equivalent
> > expressed using shell scripts and Haskell and executed with
> > hadoop streaming runs in roughly 5 minutes.
> > My dataset is stored on hdfs as a handful of tab delimited
> > text files. In sum there are 19 million rows of data.
> >
> > This is running on a 3-node cluster where each machine has
> > 8GB of ram. I have all three machines configured per the
> > instructions on the Hadoop wiki on setting up Hadoop on Ubuntu.
> >
> > Here is the pig code:
> > <code>
> > Raw = LOAD 'stats_dump_200707' USING PigStorage('\t');
> >
> > HourGroups = GROUP Raw by $0;
> >
> > RollUp = FOREACH HourGroups {
> > GENERATE FLATTEN(group), COUNT(Raw); }
> >
> > DUMP RollUp;
> > </code>
> >
> > Do I need to add the PARALLEL keyword in there somewhere?
> > Change something in hadoop-site.xml?
> >
> > The Hadoop streaming stuff uses "cut -c 1-13" as the mapper
> > and a bit of Haskell compiled with ghc as the reducer:
> > I can send the Haskell code along if it would help, but for
> > now I assume I must be doing something wrong for it to
> > perform so poorly.
> >
> > thank you
> >
> > --
> > Travis Brady
> > www.mochiads.com
> >
>
--
Travis Brady
www.mochiads.com