Sorry, I just noticed that I mistyped... Meant to say:
> direct reduce->map links.

Currently there is no sanctioned method of 'piping' the reduce output of one 
job directly into the map input of another (although it has been discussed: see 
the thread I linked before: http://www.nabble.com/Poly-reduce--tf4313116.html ).

The main focus of Hadoop is large clusters and long jobs, but an _optional_ 
method of job composition that increases speed and the chance of failure would 
definitely be worthwhile for small-to-medium clusters and short jobs.

Perhaps Vuk Ercegovac could be convinced to submit the patch he mentioned in 
the thread I linked.

Thanks,
Stu



-----Original Message-----
From: Doug Cutting <[EMAIL PROTECTED]>
Sent: Friday, November 9, 2007 11:20am
To: [email protected]
Subject: Re: Tech Talk: Dryad

Stu Hood wrote:
> The slide comparing the time taken to spill to disk between vertices vs 
> operating purely in memory (around minute 26) is definitely something to 
> think about.

I have not had a chance to watch the video yet, but, in MapReduce, if 
the intermediate dataset is larger than the RAM on your cluster, then 
you must spill to disk in order to sort.  (When it is smaller, then we 
should of course avoid disk. but that's not the typical case.)  If you 
don't sort, then it's just map, and piping a sequence of maps together 
is trivial to do on the same host, no need to even move the data over 
the wire.  So I don't yet see the direct relevance.  What am I missing? 
  (Maybe I should watch the video...)

Doug


Reply via email to