[
https://issues.apache.org/jira/browse/HADOOP-4595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Devaraj Das updated HADOOP-4595:
--------------------------------
Attachment: 4595.patch
This patch fixes a race condition in updating free slot count when the load is
high (leading to lost TTs). When a TT reinits, the TaskLauncher object is
created again. A task that is currently running might end up incrementing the
free slots of the new TaskLauncher object if it takes time to exit. This would
lead to the behavior described by Aaron in the bug report. The patch fixes this
by moving all code to do with incrementing free slots to one method and is done
inline in TaskInProgress.kill
In addition, the patch fixes a race condition to do with starting
MapEventsFetcher thread. The thread starts the loop after looking at
TaskTracker.running flag. However, when a TT reinits, the running field is set
to true only after the thread is spawned. If the thread is immediately
scheduled, it will find running false and exit. This would lead to hung reduces.
I also cleaned up some code to do with TIP.cleanup during a task launch.
> JVM Reuse triggers RuntimeException("Invalid state")
> ----------------------------------------------------
>
> Key: HADOOP-4595
> URL: https://issues.apache.org/jira/browse/HADOOP-4595
> Project: Hadoop Core
> Issue Type: Bug
> Components: mapred
> Affects Versions: 0.19.0
> Reporter: Aaron Kimball
> Assignee: Devaraj Das
> Fix For: 0.19.0
>
> Attachments: 4595.patch
>
>
> A Reducer triggers the following exception:
> 08/11/05 08:58:50 INFO mapred.JobClient: Task Id :
> attempt_200811040110_0230_r_000008_1, Status : FAILED
> java.lang.RuntimeException: Inconsistent state!!! JVM Manager reached an
> unstable state while reaping a JVM for task:
> attempt_200811040110_0230_r_000008_1 Number of active JVMs:2
> JVMId jvm_200811040110_0230_r_-735233075 #Tasks ran: 0 Currently busy? true
> Currently running: attempt_200811040110_0230_r_000012_0
> JVMId jvm_200811040110_0230_r_-1716942642 #Tasks ran: 0 Currently busy? true
> Currently running: attempt_200811040110_0230_r_000040_0
> at java.lang.Throwable.<init>(Throwable.java:67)
> at
> org.apache.hadoop.mapred.JvmManager$JvmManagerForType.reapJvm(JvmManager.java:245)
> at
> org.apache.hadoop.mapred.JvmManager$JvmManagerForType.access$000(JvmManager.java:113)
> at org.apache.hadoop.mapred.JvmManager.launchJvm(JvmManager.java:78)
> at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:410)
> Other clues:
> In the three reduce task attempts where this was observed, this was attempt
> _1. Attempt _0 had started and eventually switches to "SUCCEEDED." So I think
> this is happening only on speculatively-executed reduce task attempts. The
> reduce output (part-XXXXX) gets lost when this attempt fails, even though the
> other (earlier) attempt succeeded.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.