Its a Turing-class problem and thus non-deterministic by nature - a priori.
But given the uniform aspect of map/reduce an estimate could continually be approximated - as the data is processed - noting that, the farther from completion it is, the less accurate that calculation would be. And of course, once completed, the estimate is 100% accurate. In order to do it before, you would need an algorithm that can examine your map/reduce code and predict the running cost. Without data on prior runs, its not -mathematically- possible. As a function of cycle complexity over time (which is what big O is), map/reduce will scale somewhat linearly (maybe even logn) with regards to data - is my hunch. There's probably a quotient in there for the bookkeeping no one has data on yet though. But its a good inquiry. On Mon, 2010-03-01 at 18:25 -0500, Edward Capriolo wrote: > On Mon, Mar 1, 2010 at 4:13 PM, Darren Govoni <[email protected]> wrote: > > Theoretically. O(n) > > > > All other variables being equal across all nodes > > should...mmmmm.....reduce to n. > > > > That part that really can't be measured is the cost of Hadoop's > > bookkeeping chores as the data set grows since some things in Hadoop > > involve synchronous/serial behavior. > > > > On Mon, 2010-03-01 at 12:27 -0500, Edward Capriolo wrote: > > > >> A previous post to core-user mentioned some formula to determine job > >> time. I was wondering if anyone out there is trying to tackle > >> designing a formula that can calculate the job run time of a > >> map/reduce program. Obviously there are many variables here including > >> but not limited to Disk Speed ,Network Speed, Processor Speed, input > >> data, many constants , data-skew, map complexity, reduce complexity, # > >> of nodes...... > >> > >> As an intellectual challenge has anyone starting trying to write a > >> formula that can take into account all these factors and try to > >> actually predict a job time in minutes/hours? > > > > > > > > Understood, BIG-0 notation is really not what I am looking for. > > Given all variables are the same, a hadoop job on a finite set of data > should run for a finite time. There are parts of the process that run > linear and parts that run in parallel, but there must be a way to > express how long a job actually takes (although admittedly it is very > involved to figure out)
