[ https://issues.apache.org/jira/browse/HADOOP-3813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Amareshwari Sriramadasu updated HADOOP-3813: -------------------------------------------- Attachment: patch-3813-0.17.txt Here is a patch for branch-0.17. Earlier patch applies to both trunk and branch-0.18 > RPC queue overload of JobTracker > -------------------------------- > > Key: HADOOP-3813 > URL: https://issues.apache.org/jira/browse/HADOOP-3813 > Project: Hadoop Core > Issue Type: Bug > Components: mapred > Affects Versions: 0.17.1 > Reporter: Christian Kunz > Assignee: Amareshwari Sriramadasu > Attachments: patch-3813-0.17.txt, patch-3813.txt > > > On a cluster with about 1700 nodes, when a job with about 100,000 maps and > 10,000 reduces completed, the JobTracker, even with 80 handlers, could not > handle the rpc call load during promotion of the job, such that at the end, > because of the discarded heartbeats, the JobTracker lost nearly all > TaskTrackers (about 10 TaskTrackers left). Promotion took more than 40 > minutes. > They reconnected and everything recovered, but this might have been just luck. > Shouldn't there be an adaptive throttling of the rate in heartbeats and > TaskCompletionEvents? > Sample messsages: > 2008-07-22 18:21:55,831 WARN org.apache.hadoop.ipc.Server: Call queue > overflow discarding oldest call heartbeat([EMAIL PROTECTED], false, true, > 18137) from xxx > 2008-07-22 18:21:55,834WARN org.apache.hadoop.ipc.Server: Call queue overflow > discarding oldest call getTaskCompletionEvents(job_200807190635_0012, 119567, > 50) from yyy > ... > 2008-07-22 19:02:28,821 WARN org.apache.hadoop.ipc.Server: IPC Server handler > 1 on 9020, call heartbeat([EMAIL PROTECTED], false, true, 18199) from zzz: > discarded for being too old (40936) > 2008-07-22 19:02:28,821 WARN org.apache.hadoop.ipc.Server: IPC Server handler > 34 on 9020, call getTaskCompletionEvents(job_200807190635_0012, 119567, 50) > from uuu: discarded for being too old (40978) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.