[jira] Commented: (HADOOP-4396) sort on 400 nodes is now slower than in 18

Jothi Padmanabhan (JIRA) Tue, 14 Oct 2008 10:38:40 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12639501#action_12639501
 ]


Jothi Padmanabhan commented on HADOOP-4396:
-------------------------------------------

Devaraj and I did a deep dive into this and found the following:

1. On an average, the map tasks take less than a minute for completion. 
However, we observed that there are a few stragglers at the end of the run 
which take an unduly long time for completion (~15 minutes) that were primarily 
resulting in the overall increased run time. Most of these tasks are data-local 
tasks. There were other few tasks that took about 4-5 minutes, but those are 
expected towards the end of the run and are not the suspects.
2. The task logs for these tasks indicated that the actual map function (up to 
the beginning of the first spill) took about 14 minutes, sort and spill + merge 
parts took less than a minute
3. The data node log indicated that the first contact by the map task was as 
soon as the job started, but the map task got its final data set only after 14 
minutes. 
4. Most of these tasks ran on a few specific nodes. For example, in one run, 4 
of these ran on node x, 3 ran on node y. 
5. However, the specific nodes x and y themselves do not have any problems. On 
the next invocation of sort (the cluster was not reallocated, it was the same), 
the problem nodes were a different x' and y', all the tasks on x and y ran fine.
6. While these straggler tasks are running, the task tracker appeared to be 
busy handling shuffles
7. The nodes where these tasks were running did not show any unduly high CPU 
usage or Memory usage

Given that the 3514 patch affects sort, spill and merge parts of the map task 
and not the functionality before it, it appears that the bug is more likely a 
side effect. 
One change by this patch that could possibly be causing this side effect is the 
change to use the RawLocalFileSystem instead of the LocalFileSystem for the 
creation and handling of the intermediate files. However, It is not very clear 
how this change is affecting the data node performance?  Thoughts?



> sort on 400 nodes is now slower than in 18
> ------------------------------------------
>
>                 Key: HADOOP-4396
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4396
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.19.0
>            Reporter: Jothi Padmanabhan
>            Assignee: Jothi Padmanabhan
>            Priority: Blocker
>             Fix For: 0.19.0
>
>
> Sort on 400 nodes on  hadoop release 18 takes about 29 minutes, but with the 
> 19 branch takes about 32 minutes. This behavior is consistent.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4396) sort on 400 nodes is now slower than in 18

Reply via email to