Predicting the run time of a MapReduce/Pig/Hive job has been addressed by folks at the University of Washington (e.g. http://www.cs.washington.edu/homes/kmorton/ICDE10.pdf) and Berkeley (e.g using the techniques from http://www.cs.berkeley.edu/~archanag/publications/ICDE09.pdf).
On Mon, Mar 1, 2010 at 4:48 PM, Edward Capriolo <[email protected]>wrote: > I am looking at this many different ways. > > For example: shuffle sort might run faster if we have 12 disks not 8 per > node. > > > So shuffle sort involves data size/ disk speed network speed/ and > processor speed/ number of nodes. > > > Can we find formula to take these (and more factors ) into account? > Once we find it we should be able to plug in 12 or 8 and get a result > close to the shuffle sort time. > > > I think it would be rather cool to have a long drawn out formula.that > even made reference to some constants, like time to copy data to > distributed cache, > > > > I am looking at source data size, map complety, map output size, > shuffle sort time, reduce complexity, number of nodes and try to > arrive at a formula that will say how long a job will take. > > From there we can factor in something like all nodes have 10 g > ethernet and watch the entire thing fall apart :) > > > > > On 3/1/10, brien colwell <[email protected]> wrote: > > Map reduce should be a constant factor improvement for the algorithm > > complexity. I think you're asking for the overhead as a function of > > input/cluster size? If your algorithm has some complexity O(f(n)), and > > you spread it over M nodes (constant), with some merge complexity less > > than f(n), the total time will still be O(f(n)). > > > > I run a small job, measure the time, and then extrapolate based on the > bigO. > > > > > > > > > > > > > > On 3/1/2010 6:25 PM, Edward Capriolo wrote: > >> On Mon, Mar 1, 2010 at 4:13 PM, Darren Govoni<[email protected]> > wrote: > >> > >>> Theoretically. O(n) > >>> > >>> All other variables being equal across all nodes > >>> should...mmmmm.....reduce to n. > >>> > >>> That part that really can't be measured is the cost of Hadoop's > >>> bookkeeping chores as the data set grows since some things in Hadoop > >>> involve synchronous/serial behavior. > >>> > >>> On Mon, 2010-03-01 at 12:27 -0500, Edward Capriolo wrote: > >>> > >>> > >>>> A previous post to core-user mentioned some formula to determine job > >>>> time. I was wondering if anyone out there is trying to tackle > >>>> designing a formula that can calculate the job run time of a > >>>> map/reduce program. Obviously there are many variables here including > >>>> but not limited to Disk Speed ,Network Speed, Processor Speed, input > >>>> data, many constants , data-skew, map complexity, reduce complexity, # > >>>> of nodes...... > >>>> > >>>> As an intellectual challenge has anyone starting trying to write a > >>>> formula that can take into account all these factors and try to > >>>> actually predict a job time in minutes/hours? > >>>> > >>> > >>> > >>> > >> Understood, BIG-0 notation is really not what I am looking for. > >> > >> Given all variables are the same, a hadoop job on a finite set of data > >> should run for a finite time. There are parts of the process that run > >> linear and parts that run in parallel, but there must be a way to > >> express how long a job actually takes (although admittedly it is very > >> involved to figure out) > >> > > > > >
