Sorry, I just noticed that I mistyped... Meant to say: > direct reduce->map links.
Currently there is no sanctioned method of 'piping' the reduce output of one job directly into the map input of another (although it has been discussed: see the thread I linked before: http://www.nabble.com/Poly-reduce--tf4313116.html ). The main focus of Hadoop is large clusters and long jobs, but an _optional_ method of job composition that increases speed and the chance of failure would definitely be worthwhile for small-to-medium clusters and short jobs. Perhaps Vuk Ercegovac could be convinced to submit the patch he mentioned in the thread I linked. Thanks, Stu -----Original Message----- From: Doug Cutting <[EMAIL PROTECTED]> Sent: Friday, November 9, 2007 11:20am To: [email protected] Subject: Re: Tech Talk: Dryad Stu Hood wrote: > The slide comparing the time taken to spill to disk between vertices vs > operating purely in memory (around minute 26) is definitely something to > think about. I have not had a chance to watch the video yet, but, in MapReduce, if the intermediate dataset is larger than the RAM on your cluster, then you must spill to disk in order to sort. (When it is smaller, then we should of course avoid disk. but that's not the typical case.) If you don't sort, then it's just map, and piping a sequence of maps together is trivial to do on the same host, no need to even move the data over the wire. So I don't yet see the direct relevance. What am I missing? (Maybe I should watch the video...) Doug
