That's a great way of putting it: "increasing capacity shifts the equilibrium". If you work some examples you'll find that it doesn't take many iterations for a workflow to converge to its stable runtime. There is some minimum capacity you need for there to be an equilibrium though, or else the runtime doesn't stabilize.
Regarding the sorting, I believe the complexity in Hadoop is more like n / R * log(n/R), where R is in the number of reducers. And in practice, I've found sorting to not take very long. But you're right - fundamentally this model is an approximation. On Tue, Jan 5, 2010 at 5:29 PM, Yuri Pradkin <[email protected]> wrote: > On Sunday 03 January 2010 11:30:29 Nathan Marz <[email protected]> > wrote: > > I did some analysis on the performance of Hadoop-based workflows. Some of > > the results are counter-intuitive so I thought the community at large > would > > be interested: > > > > http://nathanmarz.com/blog/hadoop-mathematics/ > > > > Would love to hear any feedback or comments you have. > > > > Just thinking out loud: runtime generally is not equal to the "hours of > data". In your processing model it's your next iteration that will have > this > amount of data. By equating them, you're assuming an equilibrium case. > But > if you're in an equilibrium, what would increasing the capacity mean? - I > think it'd mean that you process your so-far accumulated data faster, so > your > next iteration will have less data to process and so on until you get to a > new > equilibrium (but how fast will you get there?). Lesson learned: increasing > capacity shifts equilibrium. It's kind of like a reaction time of your > system... Sometimes, I imagine, as long as you're keeping up with the > arrival > rate you don't care if it takes a week's full or a day's. In fact there > may > be some constraints on the minimum size of the input. > > Another comment: you're assuming that processing time is linear in the > input > size. Of course it depends on the processing you're doing, but even if > YOUR > processing is linear, Hadoop needs to sort keys, and that is at best > O(n*log(n)), so there is inherent non-linearity present. > > Similarly about the number of nodes: doubling your nodes will not double > your > service rate for a variety of reasons. > > -Yuri > -- Nathan Marz Twitter: @nathanmarz http://nathanmarz.com
