[
https://issues.apache.org/jira/browse/HADOOP-1374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12499174
]
Arun C Murthy commented on HADOOP-1374:
---------------------------------------
Konstantin, as per your attached logs one of your task-trackers was 'lost' (it
takes 10mins to declare it to be 'lost'), the tasks were rescheduled to the
other tracker and your job completed fine (as per the jobtracker logs)...
Ok, I've racked my brains on this one and let me try and explain what I think
is happening and potentially one short-term fix to ease our lives... fasten
your seat-belts please:
a) MapTask completes and we see the 'done' message from
{{TaskTracker:reportDone}}
b) However {{TaskTracker.reportDone}} only notes that the task is *done* by
setting a boolean (but *does not* mark the {{TaskInProgress.runstate}} as
{{SUCCEEDED}}).
c) The child jvm, for whatever reason (maybe a windows peculiarity) doesn't
'exit' (might be due to stray non-daemon threads etc.). Thus
{{TaskRunner.runChild}}'s {{process.waitFor}} is hung, and hence
{{TaskRunner.run}} cannot call {{TaskTracker.reportTaskFinished}} which is the
place which sets {{TaskInProgress.runstate}} to {{SUCCEEDED}}.
d) *10 mins* later {{TaskTracker.markUnresponsiveTasks}} marks this task as
'unresponsive' and kills it. However this might be too late since the junit
test case is killed for (possibly) over-running it's 15mins limit and we have a
failed test case.
Phew! Hope that makes sense, it looks like we might have to figure out why the
child-jvm isn't exiting in the first place. So far other than that there isn't
a bug IMO.
One option is to reduce those timeouts from 10mins to 3/5mins for the
test-cases and things should swim along fine for now, while we continue to try
and figure out this one for 0.14.0 or 0.13.1 if possible... does that sound
reasonable? Nigel?
> TaskTracker falls into an infinite loop.
> ----------------------------------------
>
> Key: HADOOP-1374
> URL: https://issues.apache.org/jira/browse/HADOOP-1374
> Project: Hadoop
> Issue Type: Bug
> Components: mapred
> Affects Versions: 0.12.3
> Reporter: Konstantin Shvachko
> Assigned To: Arun C Murthy
> Priority: Blocker
> Fix For: 0.13.0
>
> Attachments: DataNode1.log, DataNode2.log, JobTracker.log,
> NameNode.log, TaskTracker1.log, TaskTracker2.log, TestDFSIO.log
>
>
> All maps had been completed successfully. I had only one reduce task during
> which
> TaskTracker infinitely outputs:
> 07/05/15 19:35:41 INFO mapred.TaskTracker: task_0001_r_000000_0 0.16666667%
> reduce > copy (4 of 8 at 0.00 MB/s) >
> 07/05/15 19:35:42 INFO mapred.TaskTracker: task_0001_r_000000_0 0.16666667%
> reduce > copy (4 of 8 at 0.00 MB/s) >
> 07/05/15 19:35:43 INFO mapred.TaskTracker: task_0001_r_000000_0 0.16666667%
> reduce > copy (4 of 8 at 0.00 MB/s) >
> 07/05/15 19:35:44 INFO mapred.TaskTracker: task_0001_r_000000_0 0.16666667%
> reduce > copy (4 of 8 at 0.00 MB/s) >
> 07/05/15 19:35:45 INFO mapred.TaskTracker: task_0001_r_000000_0 0.16666667%
> reduce > copy (4 of 8 at 0.00 MB/s) >
> JobTracker does not log anything about task task_0001_r_000000_0 except for
> 07/05/15 19:49:01 INFO mapred.JobTracker: Adding task 'task_0001_r_000000_0'
> to tip tip_0001_r_000000, for tracker 'tracker_my-host.com:50050'
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.