Arun C Murthy
Thu, 27 Mar 2008 09:23:35 -0700
On Mar 26, 2008, at 7:03 PM, pi song wrote:
Olga, Is there any task profiling facility in Hadoop that we can use?
With hadoop-0.17 there is a feature where-in the user can ask for a small subset of map and reduce tasks to be profiled by the built-in java profiler. That's hadoop-0.17 though...
Other than that I've attached profilers manually to tasks and looked at them, tedious but doable.
Arun
Pi On 3/27/08, pi song <[EMAIL PROTECTED]> wrote:This is obviously a memory management problem that I'm investigating.Travis, when did you download Pig? After the PIG-18 commit, I find this problem occurs less often. Can you get the latest version and try again?Also, a few days ago, we have identified an issue in the memory manager. Though Ben hasn't come up with a patch yet. If the above still doesn't work,could you please try this out? 1. Look at SpillableMemoryManager.java 2. change the code that looks like this " *if* (o1Size == o2Size) { *return* 0; } *if* (o1Size < o2Size) { *return* -1; } *return* 1; } " to " *if* (o1Size == o2Size) { *return* 0; } *if* (o1Size < o2Size) { *return* 1; } *return* -1; } " (Basically just swap 1 with -1) Hope that will help Pi On 3/27/08, Travis Brady <[EMAIL PROTECTED]> wrote:Hi Olga, Thanks for your help and thank you for open sourcing Pig.I converted the FOREACH statement but it yields a very lengthy error therelevant portion of which I've pasted below.The good news is the modified FOREACH was going to invoke the combiner.2008-03-26 16:16:00,322 [main] ERRORorg.apache.pig.backend.hadoop.executionengine.mapreduceExec.MapReduc eLauncher- Error message from task (map) tip_200803261041_0008_m_000000 2008-03-26 16:16:00,322 [main] ERRORorg.apache.pig.backend.hadoop.executionengine.mapreduceExec.MapReduc eLauncher- Error message from task (map) tip_200803261041_0008_m_000002 java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2786)at java.io.ByteArrayOutputStream.write (ByteArrayOutputStream.java:71)at java.io.DataOutputStream.write(DataOutputStream.java:71) at org.apache.pig.data.Tuple.encodeInt(Tuple.java:362) at org.apache.pig.data.DataAtom.write(DataAtom.java:137) at org.apache.pig.data.Tuple.write(Tuple.java:301) at org.apache.pig.data.IndexedTuple.write(IndexedTuple.java:52) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect( MapTask.java :392) atorg.apache.pig.backend.hadoop.executionengine.mapreduceExec.PigMapRe duce$MapDataOutputCollector.add(PigMapReduce.java:304) at org.apache.pig.impl.eval.collector.UnflattenCollector.add( UnflattenCollector.java:52) at org.apache.pig.impl.eval.GenerateSpec$CrossProductItem.add( GenerateSpec.java:230) at org.apache.pig.impl.eval.collector.UnflattenCollector.add( UnflattenCollector.java:52)at org.apache.pig.impl.eval.collector.DataCollector.addToSuccessor(DataCollector.java:93)at org.apache.pig.impl.eval.SimpleEvalSpec$1.add (SimpleEvalSpec.java:35) at org.apache.pig.impl.eval.GenerateSpec$CrossProductItem.exec( GenerateSpec.java:261)at org.apache.pig.impl.eval.GenerateSpec$1.add (GenerateSpec.java:86)at org.apache.pig.impl.eval.collector.UnflattenCollector.add( UnflattenCollector.java:52) atorg.apache.pig.backend.hadoop.executionengine.mapreduceExec.PigMapRe duce.run(PigMapReduce.java:113) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:208)at org.apache.hadoop.mapred.TaskTracker$Child.main (TaskTracker.java:2084) TravisOn Wed, Mar 26, 2008 at 3:51 PM, Olga Natkovich <[EMAIL PROTECTED] inc.com>wrote:Hi Travis,There are a couple of things you can do to improve performance of yourscript.(1) At this point we have a pretty basic logic of when a combiner isinvoked. In the way your query is written now it would not be,however,if you modify you foreach statement it will be: RollUp = FOREACH HourGroups FLATTEN(group), COUNT(Raw); You can see if the combiner is invoked by running Explain RollUp.(2) You do need to use parallel keyword on the group operator to makesure it runs in parallel. Finally, we are working on some performance improvements as part of pipeline redesign. You can track the progress at https://issues.apache.org/jira/browse/pig-157. Olga-----Original Message----- From: Travis Brady [EMAIL PROTECTED] Sent: Wednesday, March 26, 2008 2:03 PM To: pig-user@incubator.apache.org Subject: Pig performance I really like writing pig code, but I'm experiencing pretty terrible performance using Pig for a simple data rollup taking about 90 minutes to complete. The equivalent expressed using shell scripts and Haskell and executed with hadoop streaming runs in roughly 5 minutes. My dataset is stored on hdfs as a handful of tab delimited text files. In sum there are 19 million rows of data. This is running on a 3-node cluster where each machine has 8GB of ram. I have all three machines configured per the instructions on the Hadoop wiki on setting up Hadoop on Ubuntu. Here is the pig code: <code> Raw = LOAD 'stats_dump_200707' USING PigStorage('\t'); HourGroups = GROUP Raw by $0; RollUp = FOREACH HourGroups { GENERATE FLATTEN(group), COUNT(Raw); } DUMP RollUp; </code> Do I need to add the PARALLEL keyword in there somewhere? Change something in hadoop-site.xml? The Hadoop streaming stuff uses "cut -c 1-13" as the mapper and a bit of Haskell compiled with ghc as the reducer: I can send the Haskell code along if it would help, but for now I assume I must be doing something wrong for it to perform so poorly. thank you -- Travis Brady www.mochiads.com-- Travis Brady www.mochiads.com