Re: Optimal setup for a test problem

Andrew Nguyen Mon, 12 Apr 2010 13:05:46 -0700

@Todd:

I do need the sorting behavior, eventually.  However, I'll try it with zero 
reduce jobs to see.


@Alex:

Yes, I was planning on incrementally building my mapper and reducer functions 
so currently, the mapper takes the value and multiplies by the gain and adds 
the offset and outputs a new key/value pair.

Started to run the tests but didn't know about how long it should take with the 
parameters you listed below.  However, it seemed like there was no progress 
being made.  Ran it with a increasing parameter values and results are included 
below:

Here is a run with nrFiles 1 and fileSize 10

had...@cluster-1:/usr/lib/hadoop$ hadoop jar hadoop-0.20.2+228-test.jar 
TestDFSIO -write -nrFiles 1 -fileSize 10
TestFDSIO.0.0.4
10/04/12 11:57:18 INFO mapred.FileInputFormat: nrFiles = 1
10/04/12 11:57:18 INFO mapred.FileInputFormat: fileSize (MB) = 10
10/04/12 11:57:18 INFO mapred.FileInputFormat: bufferSize = 1000000
10/04/12 11:57:18 INFO mapred.FileInputFormat: creating control file: 10 mega 
bytes, 1 files
10/04/12 11:57:19 INFO mapred.FileInputFormat: created control files for: 1 
files
10/04/12 11:57:19 WARN mapred.JobClient: Use GenericOptionsParser for parsing 
the arguments. Applications should implement Tool for the same.
10/04/12 11:57:19 INFO mapred.FileInputFormat: Total input paths to process : 1
10/04/12 11:57:19 INFO mapred.JobClient: Running job: job_201004111107_0017
10/04/12 11:57:20 INFO mapred.JobClient:  map 0% reduce 0%
10/04/12 11:57:27 INFO mapred.JobClient:  map 100% reduce 0%
10/04/12 11:57:39 INFO mapred.JobClient:  map 100% reduce 100%
10/04/12 11:57:41 INFO mapred.JobClient: Job complete: job_201004111107_0017
10/04/12 11:57:41 INFO mapred.JobClient: Counters: 18
10/04/12 11:57:41 INFO mapred.JobClient:   Job Counters 
10/04/12 11:57:41 INFO mapred.JobClient:     Launched reduce tasks=1
10/04/12 11:57:41 INFO mapred.JobClient:     Launched map tasks=1
10/04/12 11:57:41 INFO mapred.JobClient:     Data-local map tasks=1
10/04/12 11:57:41 INFO mapred.JobClient:   FileSystemCounters
10/04/12 11:57:41 INFO mapred.JobClient:     FILE_BYTES_READ=98
10/04/12 11:57:41 INFO mapred.JobClient:     HDFS_BYTES_READ=113
10/04/12 11:57:41 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=228
10/04/12 11:57:41 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=10485832
10/04/12 11:57:41 INFO mapred.JobClient:   Map-Reduce Framework
10/04/12 11:57:41 INFO mapred.JobClient:     Reduce input groups=5
10/04/12 11:57:41 INFO mapred.JobClient:     Combine output records=0
10/04/12 11:57:41 INFO mapred.JobClient:     Map input records=1
10/04/12 11:57:41 INFO mapred.JobClient:     Reduce shuffle bytes=0
10/04/12 11:57:41 INFO mapred.JobClient:     Reduce output records=5
10/04/12 11:57:41 INFO mapred.JobClient:     Spilled Records=10
10/04/12 11:57:41 INFO mapred.JobClient:     Map output bytes=82
10/04/12 11:57:41 INFO mapred.JobClient:     Map input bytes=27
10/04/12 11:57:41 INFO mapred.JobClient:     Combine input records=0
10/04/12 11:57:41 INFO mapred.JobClient:     Map output records=5
10/04/12 11:57:41 INFO mapred.JobClient:     Reduce input records=5
10/04/12 11:57:41 INFO mapred.FileInputFormat: ----- TestDFSIO ----- : write
10/04/12 11:57:41 INFO mapred.FileInputFormat:            Date & time: Mon Apr 
12 11:57:41 PST 2010
10/04/12 11:57:41 INFO mapred.FileInputFormat:        Number of files: 1
10/04/12 11:57:41 INFO mapred.FileInputFormat: Total MBytes processed: 10
10/04/12 11:57:41 INFO mapred.FileInputFormat:      Throughput mb/sec: 
8.710801393728223
10/04/12 11:57:41 INFO mapred.FileInputFormat: Average IO rate mb/sec: 
8.710801124572754
10/04/12 11:57:41 INFO mapred.FileInputFormat:  IO rate std deviation: 
0.0017763302275007867
10/04/12 11:57:41 INFO mapred.FileInputFormat:     Test exec time sec: 22.757
10/04/12 11:57:41 INFO mapred.FileInputFormat: 

Here is a run with nrFiles 10 and fileSize 100:

had...@cluster-1:/usr/lib/hadoop$ hadoop jar hadoop-0.20.2+228-test.jar 
TestDFSIO -write -nrFiles 10 -fileSize 100
TestFDSIO.0.0.4
10/04/12 11:58:54 INFO mapred.FileInputFormat: nrFiles = 10
10/04/12 11:58:54 INFO mapred.FileInputFormat: fileSize (MB) = 100
10/04/12 11:58:54 INFO mapred.FileInputFormat: bufferSize = 1000000
10/04/12 11:58:54 INFO mapred.FileInputFormat: creating control file: 100 mega 
bytes, 10 files
10/04/12 11:58:55 INFO mapred.FileInputFormat: created control files for: 10 
files
10/04/12 11:58:55 WARN mapred.JobClient: Use GenericOptionsParser for parsing 
the arguments. Applications should implement Tool for the same.
10/04/12 11:58:55 INFO mapred.FileInputFormat: Total input paths to process : 10
10/04/12 11:58:55 INFO mapred.JobClient: Running job: job_201004111107_0018
10/04/12 11:58:56 INFO mapred.JobClient:  map 0% reduce 0%
10/04/12 11:59:45 INFO mapred.JobClient:  map 10% reduce 0%
10/04/12 11:59:54 INFO mapred.JobClient:  map 10% reduce 3%
10/04/12 11:59:59 INFO mapred.JobClient:  map 20% reduce 3%
10/04/12 12:00:01 INFO mapred.JobClient:  map 40% reduce 3%
10/04/12 12:00:03 INFO mapred.JobClient:  map 50% reduce 3%
10/04/12 12:00:08 INFO mapred.JobClient:  map 60% reduce 3%
10/04/12 12:00:09 INFO mapred.JobClient:  map 60% reduce 16%
10/04/12 12:00:11 INFO mapred.JobClient:  map 70% reduce 16%
10/04/12 12:00:18 INFO mapred.JobClient:  map 70% reduce 20%
10/04/12 12:00:23 INFO mapred.JobClient:  map 80% reduce 20%
10/04/12 12:00:24 INFO mapred.JobClient:  map 80% reduce 23%
10/04/12 12:00:26 INFO mapred.JobClient:  map 90% reduce 23%
10/04/12 12:00:30 INFO mapred.JobClient:  map 100% reduce 23%
10/04/12 12:00:33 INFO mapred.JobClient:  map 100% reduce 26%
10/04/12 12:00:39 INFO mapred.JobClient:  map 100% reduce 100%
10/04/12 12:00:41 INFO mapred.JobClient: Job complete: job_201004111107_0018
10/04/12 12:00:41 INFO mapred.JobClient: Counters: 18
10/04/12 12:00:41 INFO mapred.JobClient:   Job Counters 
10/04/12 12:00:41 INFO mapred.JobClient:     Launched reduce tasks=1
10/04/12 12:00:41 INFO mapred.JobClient:     Launched map tasks=14
10/04/12 12:00:41 INFO mapred.JobClient:     Data-local map tasks=14
10/04/12 12:00:41 INFO mapred.JobClient:   FileSystemCounters
10/04/12 12:00:41 INFO mapred.JobClient:     FILE_BYTES_READ=961
10/04/12 12:00:41 INFO mapred.JobClient:     HDFS_BYTES_READ=1130
10/04/12 12:00:41 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=2296
10/04/12 12:00:41 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=1048576079
10/04/12 12:00:41 INFO mapred.JobClient:   Map-Reduce Framework
10/04/12 12:00:41 INFO mapred.JobClient:     Reduce input groups=5
10/04/12 12:00:41 INFO mapred.JobClient:     Combine output records=0
10/04/12 12:00:41 INFO mapred.JobClient:     Map input records=10
10/04/12 12:00:41 INFO mapred.JobClient:     Reduce shuffle bytes=914
10/04/12 12:00:41 INFO mapred.JobClient:     Reduce output records=5
10/04/12 12:00:41 INFO mapred.JobClient:     Spilled Records=100
10/04/12 12:00:41 INFO mapred.JobClient:     Map output bytes=855
10/04/12 12:00:41 INFO mapred.JobClient:     Map input bytes=270
10/04/12 12:00:41 INFO mapred.JobClient:     Combine input records=0
10/04/12 12:00:41 INFO mapred.JobClient:     Map output records=50
10/04/12 12:00:41 INFO mapred.JobClient:     Reduce input records=50
10/04/12 12:00:41 INFO mapred.FileInputFormat: ----- TestDFSIO ----- : write
10/04/12 12:00:41 INFO mapred.FileInputFormat:            Date & time: Mon Apr 
12 12:00:41 PST 2010
10/04/12 12:00:41 INFO mapred.FileInputFormat:        Number of files: 10
10/04/12 12:00:41 INFO mapred.FileInputFormat: Total MBytes processed: 1000
10/04/12 12:00:41 INFO mapred.FileInputFormat:      Throughput mb/sec: 
1.9073850132944736
10/04/12 12:00:41 INFO mapred.FileInputFormat: Average IO rate mb/sec: 
2.1501593589782715
10/04/12 12:00:41 INFO mapred.FileInputFormat:  IO rate std deviation: 
0.8994861001170683
10/04/12 12:00:41 INFO mapred.FileInputFormat:     Test exec time sec: 106.45
10/04/12 12:00:41 INFO mapred.FileInputFormat: 

The throughput is a lot lower for 10/100 vs 1/10...

Here's some rough specs of our cluster:

5 identically spec'ed nodes, each has:

2 GB RAM
Pentium 4 3.0G with HT
250GB HDD on PATA
10Mbps NIC

They are on a private network on a Dell switch.

Thanks!

--Andrew

On Apr 12, 2010, at 11:58 AM, alex kamil wrote:

> Andrew,
> 
> I would also suggest to run DFSIO benchmark to isolate io related issues
> 
> hadoop jar hadoop-0.20.2-test.jar TestDFSIO -write -nrFiles 10 -fileSize 1000
> hadoop jar hadoop-0.20.2-test.jar TestDFSIO -read -nrFiles 10 -fileSize 1000
> 
> there are additional tests specific for mapreduce -  run  "hadoop jar 
> hadoop-0.20.2-test.jar" for the complete list
> 
> 45 min for mapping 6GB on 5 nodes is way too high assuming your gain/offset 
> conversion is a simple algebraic manipulation
> 
>  it takes less than 5 min  to run a simple mapper (using streaming) on a 4 
> nodes cluster on something like 10GB, the mapper i used was an awk command 
> extracting <key:value> pair from a log (no reducer)
> 
> Thanks
> Alex
> 
> 
> 
> 
> On Mon, Apr 12, 2010 at 1:53 PM, Todd Lipcon <t...@cloudera.com> wrote:
> Hi Andrew,
> 
> Do you need the sorting behavior that having an identity reducer gives you?
> If not, set the number of reduce tasks to 0 and you'll end up with a map
> only job, which should be significantly faster.
> 
> -Todd
> 
> On Mon, Apr 12, 2010 at 9:43 AM, Andrew Nguyen <
> andrew-lists-had...@ucsfcti.org> wrote:
> 
> > Hello,
> >
> > I recently setup a 5 node cluster (1 master, 4 slaves) and am looking to
> > use it to process high volumes of patient physiologic data.  As an initial
> > exercise to gain a better understanding, I have attempted to run the
> > following problem (which isn't the type of problem that Hadoop was really
> > designed for, as is my understanding).
> >
> > I have a 6G data file, that contains key/value of <sample number, sample
> > value>.  I'd like to convert the values based on a gain/offset to their
> > physical units.  I've setup a MapReduce job using streaming where the mapper
> > does the conversion, and the reducer is just an identity reducer.  Based on
> > other threads on the mailing list, my initial results are consistent in the
> > fact that it takes considerably more time to process this in Hadoop then it
> > is on my Macbook pro (45 minutes vs. 13 minutes).  The input is a single 6G
> > file and it looks like the file is being split into 101 map tasks.  This is
> > consistent with the 64M block sizes.
> >
> > So my questions are:
> >
> > * Would it help to increase the block size to 128M?  Or, decrease the block
> > size?  What are some key factors to think about with this question?
> > * Are there any other optimizations that I could employ?  I have looked
> > into LzoCompression but I'd like to still work without compression since the
> > single thread job that I'm comparing to doesn't use any sort of compression.
> >  I know I'm comparing apples to pears a little here so please feel free to
> > correct this assumption.
> > * Is Hadoop really only good for jobs where the data doesn't fit on a
> > single node?  At some level, I assume that it can still speedup jobs that do
> > fit on one node, if only because you are performing tasks in parallel.
> >
> > Thanks!
> >
> > --Andrew
> 
> 
> 
> 
> --
> Todd Lipcon
> Software Engineer, Cloudera
>

Re: Optimal setup for a test problem

Reply via email to