[ 
https://issues.apache.org/jira/browse/HADOOP-1763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Devaraj Das updated HADOOP-1763:
--------------------------------

    Attachment: 1763.try.patch

Attached is a straightforward patch that increases the number of rpc server 
handlers from 10 to 40. Pls try things out with this patch and let us know the 
results.

> Too many lost task trackers - Job failures
> ------------------------------------------
>
>                 Key: HADOOP-1763
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1763
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>         Environment: Version: 0.15.0-dev, r565628
> Compiled: Tue Aug 14 20:55:37 UTC 2007 by hadoopqa
> 1400 Node cluster
>            Reporter: Srikanth Kakani
>            Priority: Blocker
>         Attachments: 1763.try.patch
>
>
> Steps to reproduce:
> 1 .Run a map reduce application running more than 3000 mappers, each running 
> longer than
> 2. Observe the lost task trackers.
> Observations:
> 1. Most of the lost taskTracker messages correspond to maps that have already 
> completed
> 2. Based on the logs below the taskTracker is unable to connect to the job 
> tracker and so the jobTracker deletes the job after 20 minutes
> One example:
> task_200708210155_0003_m_000000_0     node1   KILLED  0.00%           
> 21-Aug-2007 09:39:09    Lost task tracker   <-- Please note the time
> Counters:
> Map-Reduce Framework
>       Map input records       28,861
>       Map output records      1,349,114
>       Map input bytes         200,018,562
>       Map output bytes        714,878,712
> Node 1 task tracker logs:
> 2007-08-21 09:08:51,109 INFO org.apache.hadoop.mapred.TaskTracker: Task 
> task_200708210155_0003_m_000000_0 is done. <-- Please note the time
> .
> .
> .
> 2007-08-21 09:08:52,212 INFO org.mortbay.http.SocketListener: LOW ON THREADS 
> ((40-40+0)<1) on [EMAIL PROTECTED]:50060
> 2007-08-21 09:08:52,217 WARN org.mortbay.http.SocketListener: OUT OF THREADS: 
> [EMAIL PROTECTED]:50060
> .
> .
> .
> 2007-08-21 09:18:53,877 ERROR org.apache.hadoop.mapred.TaskTracker: Caught 
> exception: java.net.SocketTimeoutException: timed out waiting for rpc response
>         at org.apache.hadoop.ipc.Client.call(Client.java:472)
>         at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:165)
>         at org.apache.hadoop.mapred.$Proxy0.heartbeat(Unknown Source)
>         at 
> org.apache.hadoop.mapred.TaskTracker.transmitHeartBeat(TaskTracker.java:941)
>         at 
> org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:840)
>         at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1227)
>         at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:1911)
> .
> .
> .
> 2007-08-21 09:47:45,207 INFO org.apache.hadoop.mapred.TaskTracker: Resending 
> 'status' to 'wm501219' with reponseId '5247
> 2007-08-21 09:47:46,023 INFO org.apache.hadoop.mapred.TaskTracker: Recieved 
> RenitTrackerAction from JobTracker
> 2007-08-21 09:47:46,041 INFO org.apache.hadoop.mapred.TaskRunner: 
> task_200708210155_0003_m_000000_0 done; removing files.
> 2007-08-21 09:47:46,240 INFO org.apache.hadoop.mapred.TaskRunner: 
> task_200708210155_0003_m_002237_0 done; removing files.
> Tasktracker is pretty active otherwise:
> tracker_wm511293.inktomisearch.com:50050      wm511293.inktomisearch.com      
> 1       6       3
> JobTracker logs:
> 2007-08-21 09:01:11,951 INFO org.apache.hadoop.mapred.JobTracker: Adding task 
> 'task_200708210155_0003_m_000000_0' to tip tip_200708210155_0003_m_000000, 
> for tracker 'tracker_wm511293.inktomisearch.com:50050'
> .
> 2007-08-21 09:06:27,745 INFO org.apache.hadoop.mapred.JobTracker: Adding task 
> 'task_200708210155_0003_m_000000_1' to tip tip_200708210155_0003_m_000000, 
> for tracker 'tracker_wm511783.inktomisearch.com:50050'
> .
> 2007-08-21 09:08:51,212 INFO org.apache.hadoop.mapred.JobInProgress: Task 
> 'task_200708210155_0003_m_000000_0' has completed 
> tip_200708210155_0003_m_000000 successfully.
> 2007-08-21 09:08:51,213 INFO org.apache.hadoop.mapred.TaskInProgress: Task 
> 'task_200708210155_0003_m_000000_0' has completed succesfully
> .
> 2007-08-21 09:11:27,227 INFO org.apache.hadoop.mapred.TaskInProgress: Already 
> complete TIP tip_200708210155_0003_m_000000 has completed task 
> task_200708210155_0003_m_000000_1
> .
> 2007-08-21 09:39:09,014 INFO org.apache.hadoop.mapred.TaskInProgress: Error 
> from task_200708210155_0003_m_000000_0: Lost task tracker
> 2007-08-21 09:39:09,014 INFO org.apache.hadoop.mapred.TaskInProgress: Task 
> 'task_200708210155_0003_m_000000_0' has been lost.
> .
> 2007-08-21 09:39:09,348 INFO org.apache.hadoop.mapred.JobTracker: Removed 
> completed task 'task_200708210155_0003_m_000000_0' from 
> 'tracker_wm511293.inktomisearch.com:50050'
> .
> 2007-08-21 09:47:20,855 INFO org.apache.hadoop.mapred.TaskInProgress: Error 
> from task_200708210155_0003_m_000000_1: Lost task tracker
> 2007-08-21 09:47:20,855 WARN org.apache.hadoop.mapred.TaskInProgress: 
> Recieved duplicate status update of 'KILLED' for 
> 'task_200708210155_0003_m_000000_1' of TIP 'tip_200708210155_0003_m_000000'
> Notes:
> 1. I do  not see the taskTracker dying during that period
> 2. Is retry logic not accurate/agressive enough? (did something change 
> recently, this behavior is more evident in 0.15)
> 3. Inconsistencies with jobTracker logs? Lost task tracker detection bad?
> 4. TaskTracker:
>           CPU usage: 9:10-9:20 50%
>                                   9:20-9:40 0%
>           Network Usage: 6M incl dfs operations
> 5. JobTracker
>           CPU udage: Avg: 9%
>           Network Usage:  Negligible

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to