Map reduce should be a constant factor improvement for the algorithm complexity. I think you're asking for the overhead as a function of input/cluster size? If your algorithm has some complexity O(f(n)), and you spread it over M nodes (constant), with some merge complexity less than f(n), the total time will still be O(f(n)).

I run a small job, measure the time, and then extrapolate based on the bigO.






On 3/1/2010 6:25 PM, Edward Capriolo wrote:
On Mon, Mar 1, 2010 at 4:13 PM, Darren Govoni<[email protected]>  wrote:
Theoretically. O(n)

All other variables being equal across all nodes
should...mmmmm.....reduce to n.

That part that really can't be measured is the cost of Hadoop's
bookkeeping chores as the data set grows since some things in Hadoop
involve synchronous/serial behavior.

On Mon, 2010-03-01 at 12:27 -0500, Edward Capriolo wrote:

A previous post to core-user mentioned some formula to determine job
time. I was wondering if anyone out there is trying to tackle
designing a formula that can calculate the job run time of a
map/reduce program. Obviously there are many variables here including
but not limited to Disk Speed ,Network Speed, Processor Speed, input
data, many constants , data-skew, map complexity, reduce complexity, #
of nodes......

As an intellectual challenge has anyone starting trying to write a
formula that can take into account all these factors and try to
actually predict a job time in minutes/hours?


Understood, BIG-0 notation is really not what I am looking for.

Given all variables are the same, a hadoop job on a finite set of data
should run for a finite time. There are parts of the process that run
linear and parts that run in parallel, but there must be a way to
express how long a job actually takes (although admittedly it is very
involved to figure out)

Reply via email to