[ 
https://issues.apache.org/jira/browse/MAPREDUCE-733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12728684#action_12728684
 ] 

Vinod K V commented on MAPREDUCE-733:
-------------------------------------

Just looked at the code causing this. This happens whenever there is an attempt 
to unreserve a job's tasks from a TaskTracker even though the reservation is 
for a job other than this job. This supposedly must have been done during 
MAPREDUCE-516 itself, but unfortunately missed 
(https://issues.apache.org/jira/browse/MAPREDUCE-516?focusedCommentId=12721792&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12721792).

The resultant behavior is that when a task fails, one heartbeat of the TT is 
missed, but the next heartBeat passes through. This is because the first 
heartBeat marks the task as FAILED on the JobTracker and so the faulty code 
isn't invoked for the same TT again in further heartBeats. This leaves 
inconsistent state on the JT, for e.g, immediately following this is the code 
for creation of task completion event which would never be created for this 
task. This issue HAS to be fixed immediately because of the side effects.

One more thing I've observed while going through this is that reservations are 
not removed on a TaskTracker that is globally blacklisted either via large 
task-failure count or via unhealthy status.

> When running ant test TestTrackerBlacklistAcrossJobs, losing task tracker 
> heartbeat exception occurs. 
> ------------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-733
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-733
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: tasktracker
>            Reporter: Iyappan Srinivasan
>
> When running ant test TestTrackerBlacklistAcrossJobs, losing task tracker 
> heartbeat. 
> It seems when a  task tracker is killed , it throws exception. Instead it 
> should catch it and process it and allow the rest of the flow to go through.
> 2009-07-08 11:58:26,116 INFO  ipc.Server (Server.java:run(973)) - IPC Server 
> handler 7 on 40193, call 
> heartbeat(org.apache.hadoop.mapred.tasktrackersta...@13ec758, false, false, 
> true, 6) from 127.0.0.1:40200: error: java.io.IOException: 
> java.lang.RuntimeException: tracker_host1.rack.com:localhost/127.0.0.1:40197 
> already has slots reserved for null; being asked to un-reserve for 
> job_200907081158_0001
> java.io.IOException: java.lang.RuntimeException: 
> tracker_host1.rack.com:localhost/127.0.0.1:40197 already has slots reserved 
> for null; being asked to un-reserve for job_200907081158_0001
>         at 
> org.apache.hadoop.mapreduce.server.jobtracker.TaskTracker.unreserveSlots(TaskTracker.java:162)
>         at 
> org.apache.hadoop.mapred.JobInProgress.addTrackerTaskFailure(JobInProgress.java:1580)
>         at 
> org.apache.hadoop.mapred.JobInProgress.failedTask(JobInProgress.java:2908)
>         at 
> org.apache.hadoop.mapred.JobInProgress.updateTaskStatus(JobInProgress.java:1025)
>         at 
> org.apache.hadoop.mapred.JobTracker.updateTaskStatuses(JobTracker.java:3869)
>         at 
> org.apache.hadoop.mapred.JobTracker.processHeartbeat(JobTracker.java:3081)
>         at org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:2819)
>         at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source)
>         at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>         at java.lang.reflect.Method.invoke(Method.java:597)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:964)
>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:960)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:396)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:958)
> 2009-07-08 11:58:26,162 INFO  mapred.TaskTracker 
> (TaskTracker.java:transmitHeartBeat(1196)) - Resending 'status' to 
> 'localhost' with reponseId '6

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to