[jira] Commented: (HADOOP-1874) lost task trackers -- jobs hang

Christian Kunz (JIRA) Wed, 19 Sep 2007 01:38:06 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-1874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12528691
 ]


Christian Kunz commented on HADOOP-1874:
----------------------------------------

I noticed that namenode was rather disk-busy and, therefore, turned off most of 
the logging, but the disk-busy comes from the editlog and sync'ing:

Device:    rrqm/s wrqm/s   r/s   w/s  rsec/s  wsec/s    rkB/s    wkB/s avgrq-sz 
avgqu-sz   await  svctm  %util
sda          0.00 225.12  0.00 530.62    0.00 6047.25     0.00  3023.63    
11.40     0.35    0.66   0.66  35.17
sdb          0.00   0.00  0.00  0.00    0.00    0.00     0.00     0.00     0.00 
    0.00    0.00   0.00   0.00
sdc          0.00   0.00  0.00  0.00    0.00    0.00     0.00     0.00     0.00 
    0.00    0.00   0.00   0.00
sdd          0.00 224.79  0.00 530.12    0.00 6037.94     0.00  3018.97    
11.39     0.59    1.11   1.11  58.84

avg-cpu:  %user   %nice    %sys %iowait   %idle
           7.13    0.00    2.67   10.05   80.16

As a matter of fact, I changed namenode configuration to only journal to a 
single disk (instead to two as before) and the number of timeouts decreased a 
lot, although the namenode was still disk-bound (%util > 90%)

So I would conclude that in the short term namenode journal fsync'ing rate 
should be reduced (batching up namenode operations in atomic fashion?).



> lost task trackers -- jobs hang
> -------------------------------
>
>                 Key: HADOOP-1874
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1874
>             Project: Hadoop
>          Issue Type: Bug
>          Components: fs
>    Affects Versions: 0.15.0
>            Reporter: Christian Kunz
>            Assignee: Devaraj Das
>            Priority: Blocker
>         Attachments: lazy-dfs-ops.1.patch, lazy-dfs-ops.2.patch, 
> lazy-dfs-ops.4.patch, lazy-dfs-ops.patch, server-throttle-hack.patch
>
>
> This happens on a 1400 node cluster using a recent nightly build patched with 
> HADOOP-1763 (that fixes a previous 'lost task tracker' issue) running a 
> c++-pipes job with 4200 maps and 2800 reduces. The task trackers start to get 
> lost in high numbers at the end of job completion.
> Similar non-pipes job do not show the same problem, but is unclear whether it 
> is related to c++-pipes. It could also be dfs overload when reduce tasks 
> close and validate all newly created dfs files. I see dfs client rpc timeout 
> exception. But this alone does not explain the escalation in losing task 
> trackers.
> I also noticed that the job tracker becomes rather unresponsive with rpc 
> timeout and call queue overflow exceptions. Job Tracker is running with 60 
> handlers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1874) lost task trackers -- jobs hang

Reply via email to