Re: Hadoop streaming performance problem

Theodore Van Rooy Mon, 31 Mar 2008 13:57:58 -0700

try extending the java heap size as well.. I'd be interested to see what
kind of an effect that has on time (if any).




On Mon, Mar 31, 2008 at 2:30 PM, lin <[EMAIL PROTECTED]> wrote:

> I'm running custom map programs written in C++. What the programs do is
> very
> simple. For example, in program 2, for each input line        ID node1
> node2
> ... nodeN
> the program outputs
>        node1 ID
>        node2 ID
>        ...
>        nodeN ID
>
> Each node has 4GB to 8GB of memory. The java memory setting is -Xmx300m.
>
> I agree that it depends on the scripts. I tried replicating the
> computation
> for each input line by 10 times and saw significantly better speedup. But
> it
> is still pretty bad that Hadoop streaming has such big overhead for simple
> programs.
>
> I also tried writing program 1 with Hadoop Java API. I got almost 1000%
> speed up on the cluster.
>
> Lin
>
> On Mon, Mar 31, 2008 at 1:10 PM, Theodore Van Rooy <[EMAIL PROTECTED]>
> wrote:
>
> > are you running a custom map script or a standard linux command like WC?
> >  If
> > custom, what does your script do?
> >
> > How much ram do you have?  what are you Java memory settings?
> >
> > I used the following setup
> >
> > 2 dual core, 16 G ram, 1000MB Java heap size on an empty box with a 4
> task
> > max.
> >
> > I got the following results
> >
> > WC 30-40% speedup
> > Sort 40% speedup
> > Grep 5X slowdown (turns out this was due to what you described above...
> > Grep
> > is just very highly optimized for command line)
> > Custom perl script which is essentially a For loop which matches each
> row
> > of
> > a dataset to a set of 100 categories) 60% speedup.
> >
> > So I do think that it depends on your script... and some other settings
> of
> > yours.
> >
> > Theo
> >
> > On Mon, Mar 31, 2008 at 2:00 PM, lin <[EMAIL PROTECTED]> wrote:
> >
> > > Hi,
> > >
> > > I am looking into using Hadoop streaming to parallelize some simple
> > > programs. So far the performance has been pretty disappointing.
> > >
> > > The cluster contains 5 nodes. Each node has two CPU cores. The task
> > > capacity
> > > of each node is 2. The Hadoop version is 0.15.
> > >
> > > Program 1 runs for 3.5 minutes on the Hadoop cluster and 2 minutes in
> > > standalone (on a single CPU core). Program runs for 5 minutes on the
> > > Hadoop
> > > cluster and 4.5 minutes in standalone. Both programs run as map-only
> > jobs.
> > >
> > > I understand that there is some overhead in starting up tasks, reading
> > to
> > > and writing from the distributed file system. But they do not seem to
> > > explain all the overhead. Most map tasks are data-local. I modified
> > > program
> > > 1 to output nothing and saw the same magnitude of overhead.
> > >
> > > The output of top shows that the majority of the CPU time is consumed
> by
> > > Hadoop java processes (e.g.
> org.apache.hadoop.mapred.TaskTracker$Child).
> > > So
> > > I added a profile option (-agentlib:hprof=cpu=samples) to
> > > mapred.child.java.opts.
> > >
> > > The profile results show that most of CPU time is spent in the
> following
> > > methods
> > >
> > >   rank   self  accum   count trace method
> > >
> > >   1 23.76% 23.76%    1246 300472
> > java.lang.UNIXProcess.waitForProcessExit
> > >
> > >   2 23.74% 47.50%    1245 300474 java.io.FileInputStream.readBytes
> > >
> > >   3 23.67% 71.17%    1241 300479 java.io.FileInputStream.readBytes
> > >
> > >   4 16.15% 87.32%     847 300478 java.io.FileOutputStream.writeBytes
> > >
> > > And their stack traces show that these methods are for interacting
> with
> > > the
> > > map program.
> > >
> > >
> > > TRACE 300472:
> > >
> > >
> > >
>  java.lang.UNIXProcess.waitForProcessExit(UNIXProcess.java:Unknownline)
> > >
> > >        java.lang.UNIXProcess.access$900(UNIXProcess.java:20)
> > >
> > >        java.lang.UNIXProcess$1$1.run(UNIXProcess.java:132)
> > >
> > > TRACE 300474:
> > >
> > >        java.io.FileInputStream.readBytes(FileInputStream.java:Unknown
> > > line)
> > >
> > >        java.io.FileInputStream.read(FileInputStream.java:199)
> > >
> > >        java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
> > >
> > >        java.io.BufferedInputStream.read(BufferedInputStream.java:317)
> > >
> > >        java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
> > >
> > >        java.io.BufferedInputStream.read(BufferedInputStream.java:237)
> > >
> > >        java.io.FilterInputStream.read(FilterInputStream.java:66)
> > >
> > >        org.apache.hadoop.mapred.LineRecordReader.readLine(
> > > LineRecordReader.java:136)
> > >
> > >        org.apache.hadoop.streaming.UTF8ByteArrayUtils.readLine(
> > > UTF8ByteArrayUtils.java:157)
> > >
> > >        org.apache.hadoop.streaming.PipeMapRed$MROutputThread.run(
> > > PipeMapRed.java:348)
> > >
> > > TRACE 300479:
> > >
> > >        java.io.FileInputStream.readBytes(FileInputStream.java:Unknown
> > > line)
> > >
> > >        java.io.FileInputStream.read(FileInputStream.java:199)
> > >
> > >        java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
> > >
> > >        java.io.BufferedInputStream.read(BufferedInputStream.java:237)
> > >
> > >        java.io.FilterInputStream.read(FilterInputStream.java:66)
> > >
> > >        org.apache.hadoop.mapred.LineRecordReader.readLine(
> > > LineRecordReader.java:136)
> > >
> > >        org.apache.hadoop.streaming.UTF8ByteArrayUtils.readLine(
> > > UTF8ByteArrayUtils.java:157)
> > >
> > >        org.apache.hadoop.streaming.PipeMapRed$MRErrorThread.run(
> > > PipeMapRed.java:399)
> > >
> > > TRACE 300478:
> > >
> > >
> > >
>  java.io.FileOutputStream.writeBytes(FileOutputStream.java:Unknownline)
> > >
> > >        java.io.FileOutputStream.write(FileOutputStream.java:260)
> > >
> > >        java.io.BufferedOutputStream.flushBuffer(
> > BufferedOutputStream.java
> > > :65)
> > >
> > >
>  java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
> > >
> > >
>  java.io.BufferedOutputStream.flush(BufferedOutputStream.java:124)
> > >
> > >        java.io.DataOutputStream.flush(DataOutputStream.java:106)
> > >
> > >        org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:96)
> > >
> > >        org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
> > >
> > >        org.apache.hadoop.mapred.MapTask.run(MapTask.java:192)
> > >
>  org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java
> > > :1760)
> > >
> > >
> > > I don't understand why Hadoop streaming needs so much CPU time to read
> > > from
> > > and write to the map program. Note it takes 23.67% time to read from
> the
> > > standard error of the map program while the program does not output
> any
> > > error at all!
> > >
> > > Does anyone know any way to get rid of this seemingly unnecessary
> > overhead
> > > in Hadoop streaming?
> > >
> > > Thanks,
> > >
> > > Lin
> > >
> >
> >
> >
> > --
> > Theodore Van Rooy
> > http://greentheo.scroggles.com
> >
>



-- 
Theodore Van Rooy
http://greentheo.scroggles.com

Re: Hadoop streaming performance problem

Reply via email to