Hi I have been running some benchmarks with some hadoop jobs on different nodes and disk configurations to see what is a good configuration to get the optimum performance.
Here are some results that I have. Using the hadoop job log, I added up the timings for each off the map task and reduce task and converted them to seconds by dividing by 1000. The job is terasort as provided in the examples.jar. Column 1 is the number of nodes in the cluster. Column 2 represents the sum of all the times map jobs for a task . i.e. Sum over all TaskID's ( FINISH_TIME-START_TIME) where TASK_TYPE=Map. Column 3 the same calculation for all the reduce jobs. In all the cases all the tasks were sucessfully completed. Column 4 is the LAUNCH_TIME-FINISH_TIME for the Job. There is a SETUP_TASK and CLEANUP_TASK that take insignificant times. Column 5 is the proporation of time taken by the tasks= ( Col2+Col3)/Col4 And I am assuming that Total_Time - (MAP_Time+REDUCE_Time) is basically the framework time. Now looking at the data I see that thge framework is taking ~38%-68% of the timings Here are my questions. 1. Is this a problem with my setup or is this is the normal behaviour? 2. What can I do to reduce to the time taken for everything else? 3. Does this mean that to get a reasonably efficient hadoop cluster ( 10%-20% time for framework), do I need to get to a 1000 node cluster? 4. What is the normal "number" for the framework time ? I apologize for 1, Cross posting. I am using CDH3B3 but I don't think my questions are specific to CDH3B3. 2. Not having provided all the details of system,network, and disk configuration. 3. The different jobs have different disk configurations. Most of them are running about 2000 map and reduce jobs. 256 405.42 751.376 3483 0.332126328 256 411.574 711.841 3363 0.334051442 256 491.519 599.081 2955 0.369069374 512 491.034 1212.229 2989 0.56984376 512 471.421 947.025 2305 0.615377874 512 841 1932.633 4473 0.620083389
