[ 
https://issues.apache.org/jira/browse/HADOOP-1586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12511490
 ] 

Devaraj Das commented on HADOOP-1586:
-------------------------------------

i discovered an issue with the patch. Removing it and will submit another soon.

> Progress reporting thread can afford to be slightly lenient towards 
> exceptions other than ConnectException
> ----------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1586
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1586
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.14.0
>            Reporter: Devaraj Das
>            Assignee: Devaraj Das
>             Fix For: 0.14.0
>
>
> Currently, in the loop of Task.startCommunicationThread, MAX_RETRIES (set to 
> three) attempts are made to report progress/ping 
> (TaskUmbilicalProtocol.progress or TaskUmbilicalProtocol.ping). All attempt 
> failures are counted as critical. Here I am proposing a variant - treat only 
> ConnectException exceptions are critical and treat the others as 
> non-critical. The other exception could be the SocketTimeoutException in the 
> case of the two RPCs. 
> The reason why I am proposing this is that since HADOOP-1462 went in, I have 
> been seeing quite a few unexpected 65 deaths, and with some logging it 
> appears that they happen, most of the time, due to the SocketTimeoutException 
> in the progress RPC call (before HADOOP-1462, the return value of progress 
> would not be checked). And when the hack described above was put in, things 
> improved considerably. 
> One argument that one might make against the above proposal is that the 
> tasktracker could be faulty, when a task is not able to successfully invoke 
> an RPC on it even though it is able to connect. If this is indeed the case, 
> even in the current scheme of things, the only resort is to restart the 
> tasktracker (either manually, or, the JobTracker asks it to reinitialize), 
> and in both the cases, normal behavior of the protocol will ensure that the 
> child task will die (since the reinited tasktracker is going to return false 
> for the progress/ping calls).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to