pig-user  

Re: Pig performance

pi song
Wed, 26 Mar 2008 19:04:33 -0700

Olga,

Is there any task profiling facility in Hadoop that we can use?

Pi

On 3/27/08, pi song <[EMAIL PROTECTED]> wrote:
>
> This is obviously a memory management problem that I'm investigating.
> Travis, when did you download Pig? After the PIG-18 commit, I find this
> problem occurs less often. Can you get the latest version and try again?
>
> Also, a few days ago, we have identified an issue in the memory manager.
> Though Ben hasn't come up with a patch yet. If the above still doesn't work,
> could you please try this out?
>
> 1.  Look at SpillableMemoryManager.java
> 2.  change the code that looks like this
> "
> *if* (o1Size == o2Size) {
> *return* 0;
> }
> *if* (o1Size < o2Size) {
> *return* -1;
> }
> *return* 1;
> }
> "
>
> to
>
> "
> *if* (o1Size == o2Size) {
> *return* 0;
> }
> *if* (o1Size < o2Size) {
> *return* 1;
> }
> *return* -1;
> }
> "
>
> (Basically just swap 1 with -1)
> Hope that will help
>
> Pi
>
>
> On 3/27/08, Travis Brady <[EMAIL PROTECTED]> wrote:
> >
> > Hi Olga,
> >
> > Thanks for your help and thank you for open sourcing Pig.
> > I converted the FOREACH statement but it yields a very lengthy error the
> > relevant portion of which I've pasted below.
> >
> > The good news is the modified FOREACH was going to invoke the combiner.
> >
> > 2008-03-26 16:16:00,322 [main] ERROR
> >
> > org.apache.pig.backend.hadoop.executionengine.mapreduceExec.MapReduceLauncher
> > -
> > Error message from task (map) tip_200803261041_0008_m_000000
> > 2008-03-26 16:16:00,322 [main] ERROR
> >
> > org.apache.pig.backend.hadoop.executionengine.mapreduceExec.MapReduceLauncher
> > -
> > Error message from task (map) tip_200803261041_0008_m_000002
> > java.lang.OutOfMemoryError: Java heap space
> >    at java.util.Arrays.copyOf(Arrays.java:2786)
> >    at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:71)
> >    at java.io.DataOutputStream.write(DataOutputStream.java:71)
> >    at org.apache.pig.data.Tuple.encodeInt(Tuple.java:362)
> >    at org.apache.pig.data.DataAtom.write(DataAtom.java:137)
> >    at org.apache.pig.data.Tuple.write(Tuple.java:301)
> >    at org.apache.pig.data.IndexedTuple.write(IndexedTuple.java:52)
> >    at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(
> > MapTask.java
> > :392)
> >    at
> >
> > org.apache.pig.backend.hadoop.executionengine.mapreduceExec.PigMapReduce$MapDataOutputCollector.add
> > (PigMapReduce.java:304)
> >    at org.apache.pig.impl.eval.collector.UnflattenCollector.add(
> > UnflattenCollector.java:52)
> >    at org.apache.pig.impl.eval.GenerateSpec$CrossProductItem.add(
> > GenerateSpec.java:230)
> >    at org.apache.pig.impl.eval.collector.UnflattenCollector.add(
> > UnflattenCollector.java:52)
> >    at org.apache.pig.impl.eval.collector.DataCollector.addToSuccessor(
> > DataCollector.java:93)
> >    at org.apache.pig.impl.eval.SimpleEvalSpec$1.add(SimpleEvalSpec.java
> > :35)
> >    at org.apache.pig.impl.eval.GenerateSpec$CrossProductItem.exec(
> > GenerateSpec.java:261)
> >    at org.apache.pig.impl.eval.GenerateSpec$1.add(GenerateSpec.java:86)
> >    at org.apache.pig.impl.eval.collector.UnflattenCollector.add(
> > UnflattenCollector.java:52)
> >    at
> >
> > org.apache.pig.backend.hadoop.executionengine.mapreduceExec.PigMapReduce.run
> > (PigMapReduce.java:113)
> >    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:208)
> >    at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java
> > :2084)
> >
> >
> > Travis
> >
> >
> > On Wed, Mar 26, 2008 at 3:51 PM, Olga Natkovich <[EMAIL PROTECTED]>
> > wrote:
> >
> > > Hi Travis,
> > >
> > > There are a couple of things you can do to improve performance of your
> > > script.
> > >
> > > (1) At this point we have a pretty basic logic of when a combiner is
> > > invoked. In the way your query is written now it would not be,
> > however,
> > > if you modify you foreach statement it will be:
> > >
> > > RollUp = FOREACH HourGroups FLATTEN(group), COUNT(Raw);
> > >
> > > You can see if the combiner is invoked by running
> > >
> > > Explain RollUp.
> > >
> > > (2) You do need to use parallel keyword on the group operator to make
> > > sure it runs in parallel.
> > >
> > > Finally, we are working on some performance improvements as part of
> > > pipeline redesign. You can track the progress at
> > > https://issues.apache.org/jira/browse/pig-157.
> > >
> > > Olga
> > >
> > > > -----Original Message-----
> > > > From: Travis Brady [EMAIL PROTECTED]
> > > > Sent: Wednesday, March 26, 2008 2:03 PM
> > > > To: pig-user@incubator.apache.org
> > > > Subject: Pig performance
> > > >
> > > > I really like writing pig code, but I'm experiencing pretty
> > > > terrible performance using Pig for a simple data rollup
> > > > taking about 90 minutes to complete.  The equivalent
> > > > expressed using shell scripts and Haskell and executed with
> > > > hadoop streaming runs in roughly 5 minutes.
> > > > My dataset is stored on hdfs as a handful of tab delimited
> > > > text files.  In sum there are 19 million rows of data.
> > > >
> > > > This is running on a 3-node cluster where each machine has
> > > > 8GB of ram.  I have all three machines configured per the
> > > > instructions on the Hadoop wiki on setting up Hadoop on Ubuntu.
> > > >
> > > > Here is the pig code:
> > > > <code>
> > > > Raw = LOAD 'stats_dump_200707' USING PigStorage('\t');
> > > >
> > > > HourGroups = GROUP Raw by $0;
> > > >
> > > > RollUp = FOREACH HourGroups {
> > > >     GENERATE FLATTEN(group), COUNT(Raw); }
> > > >
> > > > DUMP RollUp;
> > > > </code>
> > > >
> > > > Do I need to add the PARALLEL keyword in there somewhere?
> > > > Change something in hadoop-site.xml?
> > > >
> > > > The Hadoop streaming stuff uses "cut -c 1-13" as the mapper
> > > > and a bit of Haskell compiled with ghc as the reducer:
> > > > I can send the Haskell code along if it would help, but for
> > > > now I assume I must be doing something wrong for it to
> > > > perform so poorly.
> > > >
> > > > thank you
> > > >
> > > > --
> > > > Travis Brady
> > > > www.mochiads.com
> > > >
> > >
> >
> >
> >
> > --
> > Travis Brady
> > www.mochiads.com
> >
>
>