Re: Tech Talk: Dryad

Owen O'Malley Fri, 09 Nov 2007 09:33:39 -0800


On Nov 9, 2007, at 8:49 AM, Stu Hood wrote:

Currently there is no sanctioned method of 'piping' the reduceoutput of one job directly into the map input of another (althoughit has been discussed: see the thread I linked before: http://www.nabble.com/Poly-reduce--tf4313116.html ).

Did you read the conclusion of the previous thread? The performancegains in avoiding the second map input are trivial compared the gainsin simplicity of having a single data path and re-execution story.During a reasonably large job, roughly 98% of your maps are readingdata on the _same_ node. Once we put in rack locality, it will beeven better.

I'd much much rather build the map/reduce primitive and support itvery well than add the additional complexity of any sort of poly-reduce. I think it is very appropriate for systems like Pig toinclude that kind of optimization, but it should not be part of thebase framework.

I watched the front of the Dryad talk and was struck by how complexit quickly became. It does give the application writer a lot ofcontrol, but to do the equivalent of a map/reduce sort with 100k mapsand 4k reduces with automatic spill-over to disk during the shuffleseemed _really_ complicated.

On a side note, in the part of the talk that I watched, the scalinggraph went from 2 to 9 nodes. Hadoop's scaling graphs go to 1000's ofnodes. Did they ever suggest later in the talk that it scales up higher?


-- Owen

Re: Tech Talk: Dryad

Reply via email to