I am looking at this many different ways.

For example: shuffle sort might run faster if we have 12 disks not 8 per node.


So shuffle sort involves data size/ disk speed network speed/ and
processor speed/ number of nodes.


Can we find formula to take these (and more factors ) into account?
Once we find it we should be able to plug in 12 or 8 and get a result
close to the shuffle sort time.


I think it would be rather cool to have a long drawn out formula.that
even made reference to some constants, like time to copy data to
distributed cache,



I am looking at source data size, map complety, map output size,
shuffle sort time, reduce complexity, number of nodes and try to
arrive at a formula that will say how long a job will take.

>From there we can factor in something like all nodes have 10 g
ethernet and watch the entire thing fall apart :)




On 3/1/10, brien colwell <[email protected]> wrote:
> Map reduce should be a constant factor improvement for the algorithm
> complexity. I think you're asking for the overhead as a function of
> input/cluster size? If your algorithm has some complexity O(f(n)), and
> you spread it over M nodes (constant), with some merge complexity less
> than f(n), the total time will still be O(f(n)).
>
> I run a small job, measure the time, and then extrapolate based on the bigO.
>
>
>
>
>
>
> On 3/1/2010 6:25 PM, Edward Capriolo wrote:
>> On Mon, Mar 1, 2010 at 4:13 PM, Darren Govoni<[email protected]>  wrote:
>>
>>> Theoretically. O(n)
>>>
>>> All other variables being equal across all nodes
>>> should...mmmmm.....reduce to n.
>>>
>>> That part that really can't be measured is the cost of Hadoop's
>>> bookkeeping chores as the data set grows since some things in Hadoop
>>> involve synchronous/serial behavior.
>>>
>>> On Mon, 2010-03-01 at 12:27 -0500, Edward Capriolo wrote:
>>>
>>>
>>>> A previous post to core-user mentioned some formula to determine job
>>>> time. I was wondering if anyone out there is trying to tackle
>>>> designing a formula that can calculate the job run time of a
>>>> map/reduce program. Obviously there are many variables here including
>>>> but not limited to Disk Speed ,Network Speed, Processor Speed, input
>>>> data, many constants , data-skew, map complexity, reduce complexity, #
>>>> of nodes......
>>>>
>>>> As an intellectual challenge has anyone starting trying to write a
>>>> formula that can take into account all these factors and try to
>>>> actually predict a job time in minutes/hours?
>>>>
>>>
>>>
>>>
>> Understood, BIG-0 notation is really not what I am looking for.
>>
>> Given all variables are the same, a hadoop job on a finite set of data
>> should run for a finite time. There are parts of the process that run
>> linear and parts that run in parallel, but there must be a way to
>> express how long a job actually takes (although admittedly it is very
>> involved to figure out)
>>
>
>

Reply via email to