Andrew,

I would also suggest to run DFSIO benchmark to isolate io related issues

hadoop jar hadoop-0.20.2-test.jar TestDFSIO -write -nrFiles 10 -fileSize
1000
hadoop jar hadoop-0.20.2-test.jar TestDFSIO -read -nrFiles 10 -fileSize 1000

there are additional tests specific for mapreduce -  run  "hadoop jar
 hadoop-0.20.2-test.jar" for the complete list

45 min for mapping 6GB on 5 nodes is way too high assuming your gain/offset
conversion is a simple algebraic manipulation

 it takes less than 5 min  to run a simple mapper (using streaming) on a 4
nodes cluster on something like 10GB, the mapper i used was an awk command
extracting <key:value> pair from a log (no reducer)

Thanks
Alex




On Mon, Apr 12, 2010 at 1:53 PM, Todd Lipcon <t...@cloudera.com> wrote:

> Hi Andrew,
>
> Do you need the sorting behavior that having an identity reducer gives you?
> If not, set the number of reduce tasks to 0 and you'll end up with a map
> only job, which should be significantly faster.
>
> -Todd
>
> On Mon, Apr 12, 2010 at 9:43 AM, Andrew Nguyen <
> andrew-lists-had...@ucsfcti.org> wrote:
>
> > Hello,
> >
> > I recently setup a 5 node cluster (1 master, 4 slaves) and am looking to
> > use it to process high volumes of patient physiologic data.  As an
> initial
> > exercise to gain a better understanding, I have attempted to run the
> > following problem (which isn't the type of problem that Hadoop was really
> > designed for, as is my understanding).
> >
> > I have a 6G data file, that contains key/value of <sample number, sample
> > value>.  I'd like to convert the values based on a gain/offset to their
> > physical units.  I've setup a MapReduce job using streaming where the
> mapper
> > does the conversion, and the reducer is just an identity reducer.  Based
> on
> > other threads on the mailing list, my initial results are consistent in
> the
> > fact that it takes considerably more time to process this in Hadoop then
> it
> > is on my Macbook pro (45 minutes vs. 13 minutes).  The input is a single
> 6G
> > file and it looks like the file is being split into 101 map tasks.  This
> is
> > consistent with the 64M block sizes.
> >
> > So my questions are:
> >
> > * Would it help to increase the block size to 128M?  Or, decrease the
> block
> > size?  What are some key factors to think about with this question?
> > * Are there any other optimizations that I could employ?  I have looked
> > into LzoCompression but I'd like to still work without compression since
> the
> > single thread job that I'm comparing to doesn't use any sort of
> compression.
> >  I know I'm comparing apples to pears a little here so please feel free
> to
> > correct this assumption.
> > * Is Hadoop really only good for jobs where the data doesn't fit on a
> > single node?  At some level, I assume that it can still speedup jobs that
> do
> > fit on one node, if only because you are performing tasks in parallel.
> >
> > Thanks!
> >
> > --Andrew
>
>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>

Reply via email to