Andrew, I would also suggest to run DFSIO benchmark to isolate io related issues
hadoop jar hadoop-0.20.2-test.jar TestDFSIO -write -nrFiles 10 -fileSize 1000 hadoop jar hadoop-0.20.2-test.jar TestDFSIO -read -nrFiles 10 -fileSize 1000 there are additional tests specific for mapreduce - run "hadoop jar hadoop-0.20.2-test.jar" for the complete list 45 min for mapping 6GB on 5 nodes is way too high assuming your gain/offset conversion is a simple algebraic manipulation it takes less than 5 min to run a simple mapper (using streaming) on a 4 nodes cluster on something like 10GB, the mapper i used was an awk command extracting <key:value> pair from a log (no reducer) Thanks Alex On Mon, Apr 12, 2010 at 1:53 PM, Todd Lipcon <t...@cloudera.com> wrote: > Hi Andrew, > > Do you need the sorting behavior that having an identity reducer gives you? > If not, set the number of reduce tasks to 0 and you'll end up with a map > only job, which should be significantly faster. > > -Todd > > On Mon, Apr 12, 2010 at 9:43 AM, Andrew Nguyen < > andrew-lists-had...@ucsfcti.org> wrote: > > > Hello, > > > > I recently setup a 5 node cluster (1 master, 4 slaves) and am looking to > > use it to process high volumes of patient physiologic data. As an > initial > > exercise to gain a better understanding, I have attempted to run the > > following problem (which isn't the type of problem that Hadoop was really > > designed for, as is my understanding). > > > > I have a 6G data file, that contains key/value of <sample number, sample > > value>. I'd like to convert the values based on a gain/offset to their > > physical units. I've setup a MapReduce job using streaming where the > mapper > > does the conversion, and the reducer is just an identity reducer. Based > on > > other threads on the mailing list, my initial results are consistent in > the > > fact that it takes considerably more time to process this in Hadoop then > it > > is on my Macbook pro (45 minutes vs. 13 minutes). The input is a single > 6G > > file and it looks like the file is being split into 101 map tasks. This > is > > consistent with the 64M block sizes. > > > > So my questions are: > > > > * Would it help to increase the block size to 128M? Or, decrease the > block > > size? What are some key factors to think about with this question? > > * Are there any other optimizations that I could employ? I have looked > > into LzoCompression but I'd like to still work without compression since > the > > single thread job that I'm comparing to doesn't use any sort of > compression. > > I know I'm comparing apples to pears a little here so please feel free > to > > correct this assumption. > > * Is Hadoop really only good for jobs where the data doesn't fit on a > > single node? At some level, I assume that it can still speedup jobs that > do > > fit on one node, if only because you are performing tasks in parallel. > > > > Thanks! > > > > --Andrew > > > > > -- > Todd Lipcon > Software Engineer, Cloudera >