Too man lost task trackers - Job failures
-----------------------------------------

                 Key: HADOOP-1763
                 URL: https://issues.apache.org/jira/browse/HADOOP-1763
             Project: Hadoop
          Issue Type: Bug
          Components: mapred
         Environment: Version: 0.15.0-dev, r565628
Compiled: Tue Aug 14 20:55:37 UTC 2007 by hadoopqa
1400 Node cluster
            Reporter: Srikanth Kakani
            Priority: Blocker


Steps to reproduce:
1 .Run a map reduce application running more than 3000 mappers, each running 
longer than
2. Observe the lost task trackers.

Observations:
1. Most of the lost taskTracker messages correspond to maps that have already 
completed
2. Based on the logs below the taskTracker is unable to connect to the job 
tracker and so the jobTracker deletes the job after 20 minutes

One example:
task_200708210155_0003_m_000000_0       node1   KILLED  0.00%           
21-Aug-2007 09:39:09    Lost task tracker   <-- Please note the time

Counters:
Map-Reduce Framework
        Map input records       28,861
        Map output records      1,349,114
        Map input bytes         200,018,562
        Map output bytes        714,878,712

Node 1 task tracker logs:
2007-08-21 09:08:51,109 INFO org.apache.hadoop.mapred.TaskTracker: Task 
task_200708210155_0003_m_000000_0 is done. <-- Please note the time
.
.
.
2007-08-21 09:08:52,212 INFO org.mortbay.http.SocketListener: LOW ON THREADS 
((40-40+0)<1) on [EMAIL PROTECTED]:50060
2007-08-21 09:08:52,217 WARN org.mortbay.http.SocketListener: OUT OF THREADS: 
[EMAIL PROTECTED]:50060
.
.
.
2007-08-21 09:18:53,877 ERROR org.apache.hadoop.mapred.TaskTracker: Caught 
exception: java.net.SocketTimeoutException: timed out waiting for rpc response
        at org.apache.hadoop.ipc.Client.call(Client.java:472)
        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:165)
        at org.apache.hadoop.mapred.$Proxy0.heartbeat(Unknown Source)
        at 
org.apache.hadoop.mapred.TaskTracker.transmitHeartBeat(TaskTracker.java:941)
        at 
org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:840)
        at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1227)
        at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:1911)
.
.
.
2007-08-21 09:47:45,207 INFO org.apache.hadoop.mapred.TaskTracker: Resending 
'status' to 'wm501219' with reponseId '5247
2007-08-21 09:47:46,023 INFO org.apache.hadoop.mapred.TaskTracker: Recieved 
RenitTrackerAction from JobTracker
2007-08-21 09:47:46,041 INFO org.apache.hadoop.mapred.TaskRunner: 
task_200708210155_0003_m_000000_0 done; removing files.
2007-08-21 09:47:46,240 INFO org.apache.hadoop.mapred.TaskRunner: 
task_200708210155_0003_m_002237_0 done; removing files.

Tasktracker is pretty active otherwise:
tracker_wm511293.inktomisearch.com:50050        wm511293.inktomisearch.com      
1       6       3

JobTracker logs:
2007-08-21 09:01:11,951 INFO org.apache.hadoop.mapred.JobTracker: Adding task 
'task_200708210155_0003_m_000000_0' to tip tip_200708210155_0003_m_000000, for 
tracker 'tracker_wm511293.inktomisearch.com:50050'
.
2007-08-21 09:06:27,745 INFO org.apache.hadoop.mapred.JobTracker: Adding task 
'task_200708210155_0003_m_000000_1' to tip tip_200708210155_0003_m_000000, for 
tracker 'tracker_wm511783.inktomisearch.com:50050'
.
2007-08-21 09:08:51,212 INFO org.apache.hadoop.mapred.JobInProgress: Task 
'task_200708210155_0003_m_000000_0' has completed 
tip_200708210155_0003_m_000000 successfully.
2007-08-21 09:08:51,213 INFO org.apache.hadoop.mapred.TaskInProgress: Task 
'task_200708210155_0003_m_000000_0' has completed succesfully
.
2007-08-21 09:11:27,227 INFO org.apache.hadoop.mapred.TaskInProgress: Already 
complete TIP tip_200708210155_0003_m_000000 has completed task 
task_200708210155_0003_m_000000_1
.
2007-08-21 09:39:09,014 INFO org.apache.hadoop.mapred.TaskInProgress: Error 
from task_200708210155_0003_m_000000_0: Lost task tracker
2007-08-21 09:39:09,014 INFO org.apache.hadoop.mapred.TaskInProgress: Task 
'task_200708210155_0003_m_000000_0' has been lost.
.
2007-08-21 09:39:09,348 INFO org.apache.hadoop.mapred.JobTracker: Removed 
completed task 'task_200708210155_0003_m_000000_0' from 
'tracker_wm511293.inktomisearch.com:50050'
.
2007-08-21 09:47:20,855 INFO org.apache.hadoop.mapred.TaskInProgress: Error 
from task_200708210155_0003_m_000000_1: Lost task tracker
2007-08-21 09:47:20,855 WARN org.apache.hadoop.mapred.TaskInProgress: Recieved 
duplicate status update of 'KILLED' for 'task_200708210155_0003_m_000000_1' of 
TIP 'tip_200708210155_0003_m_000000'



Notes:
1. I do  not see the taskTracker dying during that period
2. Is retry logic not accurate/agressive enough? (did something change 
recently, this behavior is more evident in 0.15)
3. Inconsistencies with jobTracker logs? Lost task tracker detection bad?
4. TaskTracker:
          CPU usage: 9:10-9:20 50%
                                  9:20-9:40 0%
          Network Usage: 6M incl dfs operations
5. JobTracker
          CPU udage: Avg: 9%
          Network Usage:  Negligible



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to