try extending the java heap size as well.. I'd be interested to see what kind of an effect that has on time (if any).
On Mon, Mar 31, 2008 at 2:30 PM, lin <[EMAIL PROTECTED]> wrote: > I'm running custom map programs written in C++. What the programs do is > very > simple. For example, in program 2, for each input line ID node1 > node2 > ... nodeN > the program outputs > node1 ID > node2 ID > ... > nodeN ID > > Each node has 4GB to 8GB of memory. The java memory setting is -Xmx300m. > > I agree that it depends on the scripts. I tried replicating the > computation > for each input line by 10 times and saw significantly better speedup. But > it > is still pretty bad that Hadoop streaming has such big overhead for simple > programs. > > I also tried writing program 1 with Hadoop Java API. I got almost 1000% > speed up on the cluster. > > Lin > > On Mon, Mar 31, 2008 at 1:10 PM, Theodore Van Rooy <[EMAIL PROTECTED]> > wrote: > > > are you running a custom map script or a standard linux command like WC? > > If > > custom, what does your script do? > > > > How much ram do you have? what are you Java memory settings? > > > > I used the following setup > > > > 2 dual core, 16 G ram, 1000MB Java heap size on an empty box with a 4 > task > > max. > > > > I got the following results > > > > WC 30-40% speedup > > Sort 40% speedup > > Grep 5X slowdown (turns out this was due to what you described above... > > Grep > > is just very highly optimized for command line) > > Custom perl script which is essentially a For loop which matches each > row > > of > > a dataset to a set of 100 categories) 60% speedup. > > > > So I do think that it depends on your script... and some other settings > of > > yours. > > > > Theo > > > > On Mon, Mar 31, 2008 at 2:00 PM, lin <[EMAIL PROTECTED]> wrote: > > > > > Hi, > > > > > > I am looking into using Hadoop streaming to parallelize some simple > > > programs. So far the performance has been pretty disappointing. > > > > > > The cluster contains 5 nodes. Each node has two CPU cores. The task > > > capacity > > > of each node is 2. The Hadoop version is 0.15. > > > > > > Program 1 runs for 3.5 minutes on the Hadoop cluster and 2 minutes in > > > standalone (on a single CPU core). Program runs for 5 minutes on the > > > Hadoop > > > cluster and 4.5 minutes in standalone. Both programs run as map-only > > jobs. > > > > > > I understand that there is some overhead in starting up tasks, reading > > to > > > and writing from the distributed file system. But they do not seem to > > > explain all the overhead. Most map tasks are data-local. I modified > > > program > > > 1 to output nothing and saw the same magnitude of overhead. > > > > > > The output of top shows that the majority of the CPU time is consumed > by > > > Hadoop java processes (e.g. > org.apache.hadoop.mapred.TaskTracker$Child). > > > So > > > I added a profile option (-agentlib:hprof=cpu=samples) to > > > mapred.child.java.opts. > > > > > > The profile results show that most of CPU time is spent in the > following > > > methods > > > > > > rank self accum count trace method > > > > > > 1 23.76% 23.76% 1246 300472 > > java.lang.UNIXProcess.waitForProcessExit > > > > > > 2 23.74% 47.50% 1245 300474 java.io.FileInputStream.readBytes > > > > > > 3 23.67% 71.17% 1241 300479 java.io.FileInputStream.readBytes > > > > > > 4 16.15% 87.32% 847 300478 java.io.FileOutputStream.writeBytes > > > > > > And their stack traces show that these methods are for interacting > with > > > the > > > map program. > > > > > > > > > TRACE 300472: > > > > > > > > > > java.lang.UNIXProcess.waitForProcessExit(UNIXProcess.java:Unknownline) > > > > > > java.lang.UNIXProcess.access$900(UNIXProcess.java:20) > > > > > > java.lang.UNIXProcess$1$1.run(UNIXProcess.java:132) > > > > > > TRACE 300474: > > > > > > java.io.FileInputStream.readBytes(FileInputStream.java:Unknown > > > line) > > > > > > java.io.FileInputStream.read(FileInputStream.java:199) > > > > > > java.io.BufferedInputStream.read1(BufferedInputStream.java:256) > > > > > > java.io.BufferedInputStream.read(BufferedInputStream.java:317) > > > > > > java.io.BufferedInputStream.fill(BufferedInputStream.java:218) > > > > > > java.io.BufferedInputStream.read(BufferedInputStream.java:237) > > > > > > java.io.FilterInputStream.read(FilterInputStream.java:66) > > > > > > org.apache.hadoop.mapred.LineRecordReader.readLine( > > > LineRecordReader.java:136) > > > > > > org.apache.hadoop.streaming.UTF8ByteArrayUtils.readLine( > > > UTF8ByteArrayUtils.java:157) > > > > > > org.apache.hadoop.streaming.PipeMapRed$MROutputThread.run( > > > PipeMapRed.java:348) > > > > > > TRACE 300479: > > > > > > java.io.FileInputStream.readBytes(FileInputStream.java:Unknown > > > line) > > > > > > java.io.FileInputStream.read(FileInputStream.java:199) > > > > > > java.io.BufferedInputStream.fill(BufferedInputStream.java:218) > > > > > > java.io.BufferedInputStream.read(BufferedInputStream.java:237) > > > > > > java.io.FilterInputStream.read(FilterInputStream.java:66) > > > > > > org.apache.hadoop.mapred.LineRecordReader.readLine( > > > LineRecordReader.java:136) > > > > > > org.apache.hadoop.streaming.UTF8ByteArrayUtils.readLine( > > > UTF8ByteArrayUtils.java:157) > > > > > > org.apache.hadoop.streaming.PipeMapRed$MRErrorThread.run( > > > PipeMapRed.java:399) > > > > > > TRACE 300478: > > > > > > > > > > java.io.FileOutputStream.writeBytes(FileOutputStream.java:Unknownline) > > > > > > java.io.FileOutputStream.write(FileOutputStream.java:260) > > > > > > java.io.BufferedOutputStream.flushBuffer( > > BufferedOutputStream.java > > > :65) > > > > > > > java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123) > > > > > > > java.io.BufferedOutputStream.flush(BufferedOutputStream.java:124) > > > > > > java.io.DataOutputStream.flush(DataOutputStream.java:106) > > > > > > org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:96) > > > > > > org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) > > > > > > org.apache.hadoop.mapred.MapTask.run(MapTask.java:192) > > > > org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java > > > :1760) > > > > > > > > > I don't understand why Hadoop streaming needs so much CPU time to read > > > from > > > and write to the map program. Note it takes 23.67% time to read from > the > > > standard error of the map program while the program does not output > any > > > error at all! > > > > > > Does anyone know any way to get rid of this seemingly unnecessary > > overhead > > > in Hadoop streaming? > > > > > > Thanks, > > > > > > Lin > > > > > > > > > > > -- > > Theodore Van Rooy > > http://greentheo.scroggles.com > > > -- Theodore Van Rooy http://greentheo.scroggles.com