Stu Hood wrote:
The slide comparing the time taken to spill to disk between vertices vs 
operating purely in memory (around minute 26) is definitely something to think 
about.

I have not had a chance to watch the video yet, but, in MapReduce, if the intermediate dataset is larger than the RAM on your cluster, then you must spill to disk in order to sort. (When it is smaller, then we should of course avoid disk. but that's not the typical case.) If you don't sort, then it's just map, and piping a sequence of maps together is trivial to do on the same host, no need to even move the data over the wire. So I don't yet see the direct relevance. What am I missing? (Maybe I should watch the video...)

Doug

Reply via email to