Travis Brady
Thu, 27 Mar 2008 16:25:27 -0700
Hi Pi, I'm thinking this could be related to my hadoop-site.xml settings. Specifically mapred.child.java.opts, which I'd had set at 200m (the default), but after reading this thread: http://www.mail-archive.com/pig-user@incubator.apache.org/msg00038.html I bumped mine up to 3060m and I'm not getting the memory errors anymore. What about mapred.map.tasks and mapred.reduce.tasks? Any thoughts on what those should be set to? I svn up'd pig a few hours ago so my code should be all up to date. I'm glad the memory error is gone, but is there anything else I can do to improve performance? thanks, Travis On Wed, Mar 26, 2008 at 6:58 PM, pi song <[EMAIL PROTECTED]> wrote: > This is obviously a memory management problem that I'm investigating. > Travis, when did you download Pig? After the PIG-18 commit, I find this > problem occurs less often. Can you get the latest version and try again? > > Also, a few days ago, we have identified an issue in the memory manager. > Though Ben hasn't come up with a patch yet. If the above still doesn't > work, > could you please try this out? > > 1. Look at SpillableMemoryManager.java > 2. change the code that looks like this > " > *if* (o1Size == o2Size) { > *return* 0; > } > *if* (o1Size < o2Size) { > *return* -1; > } > *return* 1; > } > " > > to > > " > *if* (o1Size == o2Size) { > *return* 0; > } > *if* (o1Size < o2Size) { > *return* 1; > } > *return* -1; > } > " > > (Basically just swap 1 with -1) > Hope that will help > > Pi > > > On 3/27/08, Travis Brady <[EMAIL PROTECTED]> wrote: > > > > Hi Olga, > > > > Thanks for your help and thank you for open sourcing Pig. > > I converted the FOREACH statement but it yields a very lengthy error the > > relevant portion of which I've pasted below. > > > > The good news is the modified FOREACH was going to invoke the combiner. > > > > 2008-03-26 16:16:00,322 [main] ERROR > > > > > org.apache.pig.backend.hadoop.executionengine.mapreduceExec.MapReduceLauncher > > - > > Error message from task (map) tip_200803261041_0008_m_000000 > > 2008-03-26 16:16:00,322 [main] ERROR > > > > > org.apache.pig.backend.hadoop.executionengine.mapreduceExec.MapReduceLauncher > > - > > Error message from task (map) tip_200803261041_0008_m_000002 > > java.lang.OutOfMemoryError: Java heap space > > at java.util.Arrays.copyOf(Arrays.java:2786) > > at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:71) > > at java.io.DataOutputStream.write(DataOutputStream.java:71) > > at org.apache.pig.data.Tuple.encodeInt(Tuple.java:362) > > at org.apache.pig.data.DataAtom.write(DataAtom.java:137) > > at org.apache.pig.data.Tuple.write(Tuple.java:301) > > at org.apache.pig.data.IndexedTuple.write(IndexedTuple.java:52) > > at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect( > > MapTask.java > > :392) > > at > > > > > org.apache.pig.backend.hadoop.executionengine.mapreduceExec.PigMapReduce$MapDataOutputCollector.add > > (PigMapReduce.java:304) > > at org.apache.pig.impl.eval.collector.UnflattenCollector.add( > > UnflattenCollector.java:52) > > at org.apache.pig.impl.eval.GenerateSpec$CrossProductItem.add( > > GenerateSpec.java:230) > > at org.apache.pig.impl.eval.collector.UnflattenCollector.add( > > UnflattenCollector.java:52) > > at org.apache.pig.impl.eval.collector.DataCollector.addToSuccessor( > > DataCollector.java:93) > > at org.apache.pig.impl.eval.SimpleEvalSpec$1.add(SimpleEvalSpec.java > > :35) > > at org.apache.pig.impl.eval.GenerateSpec$CrossProductItem.exec( > > GenerateSpec.java:261) > > at org.apache.pig.impl.eval.GenerateSpec$1.add(GenerateSpec.java:86) > > at org.apache.pig.impl.eval.collector.UnflattenCollector.add( > > UnflattenCollector.java:52) > > at > > > > > org.apache.pig.backend.hadoop.executionengine.mapreduceExec.PigMapReduce.run > > (PigMapReduce.java:113) > > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:208) > > at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java > > :2084) > > > > > > Travis > > > > > > On Wed, Mar 26, 2008 at 3:51 PM, Olga Natkovich <[EMAIL PROTECTED]> > > wrote: > > > > > Hi Travis, > > > > > > There are a couple of things you can do to improve performance of your > > > script. > > > > > > (1) At this point we have a pretty basic logic of when a combiner is > > > invoked. In the way your query is written now it would not be, > however, > > > if you modify you foreach statement it will be: > > > > > > RollUp = FOREACH HourGroups FLATTEN(group), COUNT(Raw); > > > > > > You can see if the combiner is invoked by running > > > > > > Explain RollUp. > > > > > > (2) You do need to use parallel keyword on the group operator to make > > > sure it runs in parallel. > > > > > > Finally, we are working on some performance improvements as part of > > > pipeline redesign. You can track the progress at > > > https://issues.apache.org/jira/browse/pig-157. > > > > > > Olga > > > > > > > -----Original Message----- > > > > From: Travis Brady [EMAIL PROTECTED] > > > > Sent: Wednesday, March 26, 2008 2:03 PM > > > > To: pig-user@incubator.apache.org > > > > Subject: Pig performance > > > > > > > > I really like writing pig code, but I'm experiencing pretty > > > > terrible performance using Pig for a simple data rollup > > > > taking about 90 minutes to complete. The equivalent > > > > expressed using shell scripts and Haskell and executed with > > > > hadoop streaming runs in roughly 5 minutes. > > > > My dataset is stored on hdfs as a handful of tab delimited > > > > text files. In sum there are 19 million rows of data. > > > > > > > > This is running on a 3-node cluster where each machine has > > > > 8GB of ram. I have all three machines configured per the > > > > instructions on the Hadoop wiki on setting up Hadoop on Ubuntu. > > > > > > > > Here is the pig code: > > > > <code> > > > > Raw = LOAD 'stats_dump_200707' USING PigStorage('\t'); > > > > > > > > HourGroups = GROUP Raw by $0; > > > > > > > > RollUp = FOREACH HourGroups { > > > > GENERATE FLATTEN(group), COUNT(Raw); } > > > > > > > > DUMP RollUp; > > > > </code> > > > > > > > > Do I need to add the PARALLEL keyword in there somewhere? > > > > Change something in hadoop-site.xml? > > > > > > > > The Hadoop streaming stuff uses "cut -c 1-13" as the mapper > > > > and a bit of Haskell compiled with ghc as the reducer: > > > > I can send the Haskell code along if it would help, but for > > > > now I assume I must be doing something wrong for it to > > > > perform so poorly. > > > > > > > > thank you > > > > > > > > -- > > > > Travis Brady > > > > www.mochiads.com > > > > > > > > > > > > > > > -- > > Travis Brady > > www.mochiads.com > > > -- Travis Brady www.mochiads.com