Patrick,All

APologies. My brain must have just frozen. There are two problems here. The 
first problem is that my perl script alreadfy divides the time by 1000. So the 
time is seconds not milliseconds.
The second problem is  more fundamental. You can add all the task timings., 
That ignores the inherently parallel aspect.

Here is the updated table.


256 405420 751376 3483 332.1263279 1.297368468 
256 411574 711841 3363 334.0514422 1.304888446 
256 491519 599081 2955 369.0693739 1.441677242 
512 491034 1212229 2989 569.8437605 1.112976095 
512 471421 947025 2305 615.3778742 1.201909911 
512 841000 1932633 4473 620.0833892 1.21110037 
Column 1 is the number of nodes in the cluster.
Column 2 is the total map time in seconds.
Column 3 is the total reduce time in seconds.
Column 4 is the job time in seconds.
Column 5 is (Column2+Column3)/Column4.
Column6 is Column 5/Number of nodes.

From this it appears that the infrastructure gives me around 20-40% advantage 
over a single node serail execution.. This does not include the other 
advantages such as failure tolereance and data availability.

Questions

1. Is this the norm?
2. Can I tune the system to get better numbers?
3, Are there others who can shed some thoughts on these or share data?
4 Any idea about teh loss of efficiency due to the infrastructure?
5. The 256 node cluster seems to provide an efficieny of 1.3 while 512 nodes it 
decreases to around 1.2. Would this trend continue?


Raj

Raj




From: Patrick Angeles <[email protected]>
To: [email protected]
Cc: "[email protected]" <[email protected]>
Sent: Tuesday, February 1, 2011 7:24 PM
Subject: Re: Hadoop Framework questions.

Hi, Raj.
Interesting analysis... 

These numbers appear to be off. For example, 405s for mappers + 751s for 
reducers = 1156s for all tasks. If you have 2000 map and reduce tasks, this 
means each task is spending roughly 500ms to do actual work. That is a very low 
number and seems impossible.

- P

On Wed, Feb 2, 2011 at 3:07 AM, Raj V <[email protected]> wrote: 
HiI have been running some benchmarks with some hadoop jobs on different nodes 
and disk configurations to see what is a good configuration to get the optimum 
performance.Here are some results that I have. Using the hadoop job log, I 
added up the timings for each off the map task and reduce task and converted 
them to seconds by dividing by 1000. The job is terasort as provided in the 
examples.jar.Column 1 is the number of nodes in the cluster.Column 2 represents 
the sum of all the times map jobs for a task .  i.e. Sum over all TaskID's ( 
FINISH_TIME-START_TIME) where TASK_TYPE=Map.Column 3 the same calculation for 
all the reduce jobs. In all the cases all the tasks were sucessfully 
completed.Column 4 is the LAUNCH_TIME-FINISH_TIME for the Job. There is a 
SETUP_TASK and CLEANUP_TASK that take insignificant times.Column 5 is the 
proporation of time taken by the tasks= ( Col2+Col3)/Col4And I am assuming that 
Total_Time - (MAP_Time+REDUCE_Time)  is
 basically the framework time.Now looking at the data I see that thge framework 
is taking ~38%-68% of the timingsHere are my questions.1. Is this a problem 
with my setup or is this is the normal behaviour?2. What can I do to reduce to 
the time taken for everything else?3. Does this mean  that to get a reasonably 
efficient hadoop cluster ( 10%-20% time for framework), do I need to get to a 
1000 node cluster?4. What is the normal "number" for the framework time ?I 
apologize for1, Cross posting. I am using CDH3B3 but I don't think my questions 
are specific to CDH3B3.2. Not having provided all the details of 
system,network, and disk configuration. 3. The different jobs have different 
disk configurations. Most of them are running about 2000 map and reduce jobs. 
>256 405.42 751.376 3483 0.332126328 
>256 411.574 711.841 3363 0.334051442 
>256 491.519 599.081 2955 0.369069374 
>512 491.034 1212.229 2989 0.56984376 
>512 471.421 947.025 2305 0.615377874 
>512 841 1932.633 4473 0.620083389 

Reply via email to