pig-user  

Re: Pig performance

Arun C Murthy
Thu, 27 Mar 2008 09:23:35 -0700


On Mar 26, 2008, at 7:03 PM, pi song wrote:

Olga,

Is there any task profiling facility in Hadoop that we can use?


With hadoop-0.17 there is a feature where-in the user can ask for a small subset of map and reduce tasks to be profiled by the built-in java profiler. That's hadoop-0.17 though...

Other than that I've attached profilers manually to tasks and looked at them, tedious but doable.

Arun

Pi

On 3/27/08, pi song <[EMAIL PROTECTED]> wrote:

This is obviously a memory management problem that I'm investigating.
Travis, when did you download Pig? After the PIG-18 commit, I find this problem occurs less often. Can you get the latest version and try again?

Also, a few days ago, we have identified an issue in the memory manager. Though Ben hasn't come up with a patch yet. If the above still doesn't work,
could you please try this out?

1.  Look at SpillableMemoryManager.java
2.  change the code that looks like this
"
*if* (o1Size == o2Size) {
*return* 0;
}
*if* (o1Size < o2Size) {
*return* -1;
}
*return* 1;
}
"

to

"
*if* (o1Size == o2Size) {
*return* 0;
}
*if* (o1Size < o2Size) {
*return* 1;
}
*return* -1;
}
"

(Basically just swap 1 with -1)
Hope that will help

Pi


On 3/27/08, Travis Brady <[EMAIL PROTECTED]> wrote:

Hi Olga,

Thanks for your help and thank you for open sourcing Pig.
I converted the FOREACH statement but it yields a very lengthy error the
relevant portion of which I've pasted below.

The good news is the modified FOREACH was going to invoke the combiner.

2008-03-26 16:16:00,322 [main] ERROR

org.apache.pig.backend.hadoop.executionengine.mapreduceExec.MapReduc eLauncher
-
Error message from task (map) tip_200803261041_0008_m_000000
2008-03-26 16:16:00,322 [main] ERROR

org.apache.pig.backend.hadoop.executionengine.mapreduceExec.MapReduc eLauncher
-
Error message from task (map) tip_200803261041_0008_m_000002
java.lang.OutOfMemoryError: Java heap space
   at java.util.Arrays.copyOf(Arrays.java:2786)
at java.io.ByteArrayOutputStream.write (ByteArrayOutputStream.java:71)
   at java.io.DataOutputStream.write(DataOutputStream.java:71)
   at org.apache.pig.data.Tuple.encodeInt(Tuple.java:362)
   at org.apache.pig.data.DataAtom.write(DataAtom.java:137)
   at org.apache.pig.data.Tuple.write(Tuple.java:301)
   at org.apache.pig.data.IndexedTuple.write(IndexedTuple.java:52)
   at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(
MapTask.java
:392)
   at

org.apache.pig.backend.hadoop.executionengine.mapreduceExec.PigMapRe duce$MapDataOutputCollector.add
(PigMapReduce.java:304)
   at org.apache.pig.impl.eval.collector.UnflattenCollector.add(
UnflattenCollector.java:52)
   at org.apache.pig.impl.eval.GenerateSpec$CrossProductItem.add(
GenerateSpec.java:230)
   at org.apache.pig.impl.eval.collector.UnflattenCollector.add(
UnflattenCollector.java:52)
at org.apache.pig.impl.eval.collector.DataCollector.addToSuccessor(
DataCollector.java:93)
at org.apache.pig.impl.eval.SimpleEvalSpec$1.add (SimpleEvalSpec.java
:35)
   at org.apache.pig.impl.eval.GenerateSpec$CrossProductItem.exec(
GenerateSpec.java:261)
at org.apache.pig.impl.eval.GenerateSpec$1.add (GenerateSpec.java:86)
   at org.apache.pig.impl.eval.collector.UnflattenCollector.add(
UnflattenCollector.java:52)
   at

org.apache.pig.backend.hadoop.executionengine.mapreduceExec.PigMapRe duce.run
(PigMapReduce.java:113)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:208)
at org.apache.hadoop.mapred.TaskTracker$Child.main (TaskTracker.java
:2084)


Travis


On Wed, Mar 26, 2008 at 3:51 PM, Olga Natkovich <[EMAIL PROTECTED] inc.com>
wrote:

Hi Travis,

There are a couple of things you can do to improve performance of your
script.

(1) At this point we have a pretty basic logic of when a combiner is
invoked. In the way your query is written now it would not be,
however,
if you modify you foreach statement it will be:

RollUp = FOREACH HourGroups FLATTEN(group), COUNT(Raw);

You can see if the combiner is invoked by running

Explain RollUp.

(2) You do need to use parallel keyword on the group operator to make
sure it runs in parallel.

Finally, we are working on some performance improvements as part of
pipeline redesign. You can track the progress at
https://issues.apache.org/jira/browse/pig-157.

Olga

-----Original Message-----
From: Travis Brady [EMAIL PROTECTED]
Sent: Wednesday, March 26, 2008 2:03 PM
To: pig-user@incubator.apache.org
Subject: Pig performance

I really like writing pig code, but I'm experiencing pretty
terrible performance using Pig for a simple data rollup
taking about 90 minutes to complete.  The equivalent
expressed using shell scripts and Haskell and executed with
hadoop streaming runs in roughly 5 minutes.
My dataset is stored on hdfs as a handful of tab delimited
text files.  In sum there are 19 million rows of data.

This is running on a 3-node cluster where each machine has
8GB of ram.  I have all three machines configured per the
instructions on the Hadoop wiki on setting up Hadoop on Ubuntu.

Here is the pig code:
<code>
Raw = LOAD 'stats_dump_200707' USING PigStorage('\t');

HourGroups = GROUP Raw by $0;

RollUp = FOREACH HourGroups {
    GENERATE FLATTEN(group), COUNT(Raw); }

DUMP RollUp;
</code>

Do I need to add the PARALLEL keyword in there somewhere?
Change something in hadoop-site.xml?

The Hadoop streaming stuff uses "cut -c 1-13" as the mapper
and a bit of Haskell compiled with ghc as the reducer:
I can send the Haskell code along if it would help, but for
now I assume I must be doing something wrong for it to
perform so poorly.

thank you

--
Travis Brady
www.mochiads.com





--
Travis Brady
www.mochiads.com