[
https://issues.apache.org/jira/browse/HADOOP-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12639501#action_12639501
]
Jothi Padmanabhan commented on HADOOP-4396:
-------------------------------------------
Devaraj and I did a deep dive into this and found the following:
1. On an average, the map tasks take less than a minute for completion.
However, we observed that there are a few stragglers at the end of the run
which take an unduly long time for completion (~15 minutes) that were primarily
resulting in the overall increased run time. Most of these tasks are data-local
tasks. There were other few tasks that took about 4-5 minutes, but those are
expected towards the end of the run and are not the suspects.
2. The task logs for these tasks indicated that the actual map function (up to
the beginning of the first spill) took about 14 minutes, sort and spill + merge
parts took less than a minute
3. The data node log indicated that the first contact by the map task was as
soon as the job started, but the map task got its final data set only after 14
minutes.
4. Most of these tasks ran on a few specific nodes. For example, in one run, 4
of these ran on node x, 3 ran on node y.
5. However, the specific nodes x and y themselves do not have any problems. On
the next invocation of sort (the cluster was not reallocated, it was the same),
the problem nodes were a different x' and y', all the tasks on x and y ran fine.
6. While these straggler tasks are running, the task tracker appeared to be
busy handling shuffles
7. The nodes where these tasks were running did not show any unduly high CPU
usage or Memory usage
Given that the 3514 patch affects sort, spill and merge parts of the map task
and not the functionality before it, it appears that the bug is more likely a
side effect.
One change by this patch that could possibly be causing this side effect is the
change to use the RawLocalFileSystem instead of the LocalFileSystem for the
creation and handling of the intermediate files. However, It is not very clear
how this change is affecting the data node performance? Thoughts?
> sort on 400 nodes is now slower than in 18
> ------------------------------------------
>
> Key: HADOOP-4396
> URL: https://issues.apache.org/jira/browse/HADOOP-4396
> Project: Hadoop Core
> Issue Type: Bug
> Components: mapred
> Affects Versions: 0.19.0
> Reporter: Jothi Padmanabhan
> Assignee: Jothi Padmanabhan
> Priority: Blocker
> Fix For: 0.19.0
>
>
> Sort on 400 nodes on hadoop release 18 takes about 29 minutes, but with the
> 19 branch takes about 32 minutes. This behavior is consistent.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.