[ https://issues.apache.org/jira/browse/HADOOP-1586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Devaraj Das updated HADOOP-1586: -------------------------------- Attachment: (was: 1586.patch) > Progress reporting thread can afford to be slightly lenient towards > exceptions other than ConnectException > ---------------------------------------------------------------------------------------------------------- > > Key: HADOOP-1586 > URL: https://issues.apache.org/jira/browse/HADOOP-1586 > Project: Hadoop > Issue Type: Bug > Components: mapred > Affects Versions: 0.14.0 > Reporter: Devaraj Das > Assignee: Devaraj Das > Fix For: 0.14.0 > > > Currently, in the loop of Task.startCommunicationThread, MAX_RETRIES (set to > three) attempts are made to report progress/ping > (TaskUmbilicalProtocol.progress or TaskUmbilicalProtocol.ping). All attempt > failures are counted as critical. Here I am proposing a variant - treat only > ConnectException exceptions are critical and treat the others as > non-critical. The other exception could be the SocketTimeoutException in the > case of the two RPCs. > The reason why I am proposing this is that since HADOOP-1462 went in, I have > been seeing quite a few unexpected 65 deaths, and with some logging it > appears that they happen, most of the time, due to the SocketTimeoutException > in the progress RPC call (before HADOOP-1462, the return value of progress > would not be checked). And when the hack described above was put in, things > improved considerably. > One argument that one might make against the above proposal is that the > tasktracker could be faulty, when a task is not able to successfully invoke > an RPC on it even though it is able to connect. If this is indeed the case, > even in the current scheme of things, the only resort is to restart the > tasktracker (either manually, or, the JobTracker asks it to reinitialize), > and in both the cases, normal behavior of the protocol will ensure that the > child task will die (since the reinited tasktracker is going to return false > for the progress/ping calls). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.