[jira] Commented: (HADOOP-1874) lost task trackers -- jobs hang

Raghu Angadi (JIRA) Tue, 18 Sep 2007 18:13:07 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-1874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12528625
 ]


Raghu Angadi commented on HADOOP-1874:
--------------------------------------

hmm.. I was wondering if the following would happen when I was submitting 
server throttle hack (looks like it does :():

- Server gets backed up at one momemnt, and it reads slower from the client.
- It looks like if there it does not receive anything from a client for 2 min, 
it closes the connection. I was not yesterday, when the client closes a 
connection.
- Ideally what server should do in that case is not to process _any_ more RPCs 
from that connection. But since there is still readable data on the closed 
socket it patiently reads and executes the RPC that are going to be thrown 
away. Any such unnecessary work done will result in bad feedback of ever 
increasing load since client retries the same RPCs on different socket. I 
wonder what 'netstat' would have shown in this case on the namenode. My guess 
is that there should be a LOT of these exceptions while writing the reply.

Let me know if you want to try an updated server throttle patch.


> lost task trackers -- jobs hang
> -------------------------------
>
>                 Key: HADOOP-1874
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1874
>             Project: Hadoop
>          Issue Type: Bug
>          Components: fs
>    Affects Versions: 0.15.0
>            Reporter: Christian Kunz
>            Assignee: Devaraj Das
>            Priority: Blocker
>         Attachments: lazy-dfs-ops.1.patch, lazy-dfs-ops.2.patch, 
> lazy-dfs-ops.4.patch, lazy-dfs-ops.patch, server-throttle-hack.patch
>
>
> This happens on a 1400 node cluster using a recent nightly build patched with 
> HADOOP-1763 (that fixes a previous 'lost task tracker' issue) running a 
> c++-pipes job with 4200 maps and 2800 reduces. The task trackers start to get 
> lost in high numbers at the end of job completion.
> Similar non-pipes job do not show the same problem, but is unclear whether it 
> is related to c++-pipes. It could also be dfs overload when reduce tasks 
> close and validate all newly created dfs files. I see dfs client rpc timeout 
> exception. But this alone does not explain the escalation in losing task 
> trackers.
> I also noticed that the job tracker becomes rather unresponsive with rpc 
> timeout and call queue overflow exceptions. Job Tracker is running with 60 
> handlers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1874) lost task trackers -- jobs hang

Reply via email to