[ https://issues.apache.org/jira/browse/HADOOP-1763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Devaraj Das updated HADOOP-1763: -------------------------------- Assignee: Devaraj Das Status: Patch Available (was: Open) > Too many lost task trackers - Job failures > ------------------------------------------ > > Key: HADOOP-1763 > URL: https://issues.apache.org/jira/browse/HADOOP-1763 > Project: Hadoop > Issue Type: Bug > Components: mapred > Environment: Version: 0.15.0-dev, r565628 > Compiled: Tue Aug 14 20:55:37 UTC 2007 by hadoopqa > 1400 Node cluster > Reporter: Srikanth Kakani > Assignee: Devaraj Das > Priority: Blocker > Attachments: 1763.patch, 1763.try.patch > > > Steps to reproduce: > 1 .Run a map reduce application running more than 3000 mappers, each running > longer than > 2. Observe the lost task trackers. > Observations: > 1. Most of the lost taskTracker messages correspond to maps that have already > completed > 2. Based on the logs below the taskTracker is unable to connect to the job > tracker and so the jobTracker deletes the job after 20 minutes > One example: > task_200708210155_0003_m_000000_0 node1 KILLED 0.00% > 21-Aug-2007 09:39:09 Lost task tracker <-- Please note the time > Counters: > Map-Reduce Framework > Map input records 28,861 > Map output records 1,349,114 > Map input bytes 200,018,562 > Map output bytes 714,878,712 > Node 1 task tracker logs: > 2007-08-21 09:08:51,109 INFO org.apache.hadoop.mapred.TaskTracker: Task > task_200708210155_0003_m_000000_0 is done. <-- Please note the time > . > . > . > 2007-08-21 09:08:52,212 INFO org.mortbay.http.SocketListener: LOW ON THREADS > ((40-40+0)<1) on [EMAIL PROTECTED]:50060 > 2007-08-21 09:08:52,217 WARN org.mortbay.http.SocketListener: OUT OF THREADS: > [EMAIL PROTECTED]:50060 > . > . > . > 2007-08-21 09:18:53,877 ERROR org.apache.hadoop.mapred.TaskTracker: Caught > exception: java.net.SocketTimeoutException: timed out waiting for rpc response > at org.apache.hadoop.ipc.Client.call(Client.java:472) > at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:165) > at org.apache.hadoop.mapred.$Proxy0.heartbeat(Unknown Source) > at > org.apache.hadoop.mapred.TaskTracker.transmitHeartBeat(TaskTracker.java:941) > at > org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:840) > at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1227) > at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:1911) > . > . > . > 2007-08-21 09:47:45,207 INFO org.apache.hadoop.mapred.TaskTracker: Resending > 'status' to 'wm501219' with reponseId '5247 > 2007-08-21 09:47:46,023 INFO org.apache.hadoop.mapred.TaskTracker: Recieved > RenitTrackerAction from JobTracker > 2007-08-21 09:47:46,041 INFO org.apache.hadoop.mapred.TaskRunner: > task_200708210155_0003_m_000000_0 done; removing files. > 2007-08-21 09:47:46,240 INFO org.apache.hadoop.mapred.TaskRunner: > task_200708210155_0003_m_002237_0 done; removing files. > Tasktracker is pretty active otherwise: > tracker_wm511293.inktomisearch.com:50050 wm511293.inktomisearch.com > 1 6 3 > JobTracker logs: > 2007-08-21 09:01:11,951 INFO org.apache.hadoop.mapred.JobTracker: Adding task > 'task_200708210155_0003_m_000000_0' to tip tip_200708210155_0003_m_000000, > for tracker 'tracker_wm511293.inktomisearch.com:50050' > . > 2007-08-21 09:06:27,745 INFO org.apache.hadoop.mapred.JobTracker: Adding task > 'task_200708210155_0003_m_000000_1' to tip tip_200708210155_0003_m_000000, > for tracker 'tracker_wm511783.inktomisearch.com:50050' > . > 2007-08-21 09:08:51,212 INFO org.apache.hadoop.mapred.JobInProgress: Task > 'task_200708210155_0003_m_000000_0' has completed > tip_200708210155_0003_m_000000 successfully. > 2007-08-21 09:08:51,213 INFO org.apache.hadoop.mapred.TaskInProgress: Task > 'task_200708210155_0003_m_000000_0' has completed succesfully > . > 2007-08-21 09:11:27,227 INFO org.apache.hadoop.mapred.TaskInProgress: Already > complete TIP tip_200708210155_0003_m_000000 has completed task > task_200708210155_0003_m_000000_1 > . > 2007-08-21 09:39:09,014 INFO org.apache.hadoop.mapred.TaskInProgress: Error > from task_200708210155_0003_m_000000_0: Lost task tracker > 2007-08-21 09:39:09,014 INFO org.apache.hadoop.mapred.TaskInProgress: Task > 'task_200708210155_0003_m_000000_0' has been lost. > . > 2007-08-21 09:39:09,348 INFO org.apache.hadoop.mapred.JobTracker: Removed > completed task 'task_200708210155_0003_m_000000_0' from > 'tracker_wm511293.inktomisearch.com:50050' > . > 2007-08-21 09:47:20,855 INFO org.apache.hadoop.mapred.TaskInProgress: Error > from task_200708210155_0003_m_000000_1: Lost task tracker > 2007-08-21 09:47:20,855 WARN org.apache.hadoop.mapred.TaskInProgress: > Recieved duplicate status update of 'KILLED' for > 'task_200708210155_0003_m_000000_1' of TIP 'tip_200708210155_0003_m_000000' > Notes: > 1. I do not see the taskTracker dying during that period > 2. Is retry logic not accurate/agressive enough? (did something change > recently, this behavior is more evident in 0.15) > 3. Inconsistencies with jobTracker logs? Lost task tracker detection bad? > 4. TaskTracker: > CPU usage: 9:10-9:20 50% > 9:20-9:40 0% > Network Usage: 6M incl dfs operations > 5. JobTracker > CPU udage: Avg: 9% > Network Usage: Negligible -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.