I did read the conclusion of the previous thread, which was that nobody 
"thought" that the performance gains would be worth the added complexity. I 
simply think that if a patch is available, the developer should be encouraged 
to submit it for review, since the topic has been discussed so frequently.

I think our concept of "the" map/reduce primitive has been limited in scope to 
the capabilities that Google described. There is no reason not to explore 
potentially beneficial additions (even if Google didn't think they were 
worthwhile).

Yes, Dryad is more confusing, because it is using a more flexible primitive. 
I'm not suggesting that Hadoop should be rewritten to use a DAG at its core, 
but we do already have the o.a.h.m.jobcontrol.JobControl module, so _somebody_ 
must think the concept is useful.

Re: Side note: As the presenter explained, he uses a small example first to 
demonstrate the linear speedup. Next (~32 minute) he goes to an example of 
sorting 10TB on 1800 machines in ~12 minutes...

Thanks,
Stu



-----Original Message-----
From: Owen O'Malley <[EMAIL PROTECTED]>
Sent: Friday, November 9, 2007 12:32pm
To: [email protected]
Subject: Re: Tech Talk: Dryad


On Nov 9, 2007, at 8:49 AM, Stu Hood wrote:

> Currently there is no sanctioned method of 'piping' the reduce  
> output of one job directly into the map input of another (although  
> it has been discussed: see the thread I linked before: http:// 
> www.nabble.com/Poly-reduce--tf4313116.html ).

Did you read the conclusion of the previous thread? The performance  
gains in avoiding the second map input are trivial compared the gains  
in simplicity of having a single data path and re-execution story.  
During a reasonably large job, roughly 98% of your maps are reading  
data on the _same_ node. Once we put in rack locality, it will be  
even better.

I'd much much rather build the map/reduce primitive and support it  
very well than add the additional complexity of any sort of poly- 
reduce. I think it is very appropriate for systems like Pig to  
include that kind of optimization, but it should not be part of the  
base framework.

I watched the front of the Dryad talk and was struck by how complex  
it quickly became. It does give the application writer a lot of  
control, but to do the equivalent of a map/reduce sort with 100k maps  
and 4k reduces with automatic spill-over to disk during the shuffle  
seemed _really_ complicated.

On a side note, in the part of the talk that I watched, the scaling  
graph went from 2 to 9 nodes. Hadoop's scaling graphs go to 1000's of  
nodes. Did they ever suggest later in the talk that it scales up higher?

-- Owen


Reply via email to