I am looking at this many different ways. For example: shuffle sort might run faster if we have 12 disks not 8 per node.
So shuffle sort involves data size/ disk speed network speed/ and processor speed/ number of nodes. Can we find formula to take these (and more factors ) into account? Once we find it we should be able to plug in 12 or 8 and get a result close to the shuffle sort time. I think it would be rather cool to have a long drawn out formula.that even made reference to some constants, like time to copy data to distributed cache, I am looking at source data size, map complety, map output size, shuffle sort time, reduce complexity, number of nodes and try to arrive at a formula that will say how long a job will take. >From there we can factor in something like all nodes have 10 g ethernet and watch the entire thing fall apart :) On 3/1/10, brien colwell <[email protected]> wrote: > Map reduce should be a constant factor improvement for the algorithm > complexity. I think you're asking for the overhead as a function of > input/cluster size? If your algorithm has some complexity O(f(n)), and > you spread it over M nodes (constant), with some merge complexity less > than f(n), the total time will still be O(f(n)). > > I run a small job, measure the time, and then extrapolate based on the bigO. > > > > > > > On 3/1/2010 6:25 PM, Edward Capriolo wrote: >> On Mon, Mar 1, 2010 at 4:13 PM, Darren Govoni<[email protected]> wrote: >> >>> Theoretically. O(n) >>> >>> All other variables being equal across all nodes >>> should...mmmmm.....reduce to n. >>> >>> That part that really can't be measured is the cost of Hadoop's >>> bookkeeping chores as the data set grows since some things in Hadoop >>> involve synchronous/serial behavior. >>> >>> On Mon, 2010-03-01 at 12:27 -0500, Edward Capriolo wrote: >>> >>> >>>> A previous post to core-user mentioned some formula to determine job >>>> time. I was wondering if anyone out there is trying to tackle >>>> designing a formula that can calculate the job run time of a >>>> map/reduce program. Obviously there are many variables here including >>>> but not limited to Disk Speed ,Network Speed, Processor Speed, input >>>> data, many constants , data-skew, map complexity, reduce complexity, # >>>> of nodes...... >>>> >>>> As an intellectual challenge has anyone starting trying to write a >>>> formula that can take into account all these factors and try to >>>> actually predict a job time in minutes/hours? >>>> >>> >>> >>> >> Understood, BIG-0 notation is really not what I am looking for. >> >> Given all variables are the same, a hadoop job on a finite set of data >> should run for a finite time. There are parts of the process that run >> linear and parts that run in parallel, but there must be a way to >> express how long a job actually takes (although admittedly it is very >> involved to figure out) >> > >
