[
https://issues.apache.org/jira/browse/MAPREDUCE-805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12741681#action_12741681
]
Amar Kamat commented on MAPREDUCE-805:
--------------------------------------
Note that I purposefully added sleeps in JobTracker.initJob() and
JobInProgress.initTasks to take care of race conditions. I didnt see any side
effect. With this patch init will always keep the job in PREP state but based
on whether
- setup is required or not
- tasks are needed to run
- job-kill was issued during init
- job-init failed
the job can move to RUNNING or SUCCCEEDED or KILLED or FAILED state or remain
in PREP state. Here is how the state transition happens (note that after
job.initTasks() the job will be in PREP state)
||setup needed?||maps=0 and reduces=0?||job killed during init?||init
failed?||new state||comment||
|*|*|*|yes|FAILED|irrespective of what the config is, if the job fails in init,
its marked as FAILED|
|*|*|yes|no|KILLED|irrespective of what the config is, if the job is killed
during init and init passed normally then the job is marked as KILLED|
|yes|*|no|no|PREP|if job is configured to run setup then the job will remain in
PREP state|
|no|yes|no|no|SUCCEEDED|if the job has no setup configured and if there are no
maps and reduces then the job is marked SUCCEEDED|
|no|no|no|no|RUNNING|if the job has no setup configured and if there are maps
and reduces then the job is marked RUNNING|
> Deadlock in Jobtracker
> ----------------------
>
> Key: MAPREDUCE-805
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-805
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Reporter: Michael Tamm
> Attachments: MAPREDUCE-805-v1.1.patch,
> MAPREDUCE-805-v1.11-branch-0.20.patch, MAPREDUCE-805-v1.11.patch,
> MAPREDUCE-805-v1.2.patch, MAPREDUCE-805-v1.3.patch, MAPREDUCE-805-v1.6.patch,
> MAPREDUCE-805-v1.7.patch
>
>
> We are running a hadoop cluster (version 0.20.0) and have detected the
> following deadlock on our jobtracker:
> {code}
> "IPC Server handler 51 on 9001":
> at
> org.apache.hadoop.mapred.JobInProgress.getCounters(JobInProgress.java:943)
> - waiting to lock <0x00007f2b6fb46130> (a
> org.apache.hadoop.mapred.JobInProgress)
> at
> org.apache.hadoop.mapred.JobTracker.getJobCounters(JobTracker.java:3102)
> - locked <0x00007f2b5f026000> (a org.apache.hadoop.mapred.JobTracker)
> at sun.reflect.GeneratedMethodAccessor21.invoke(Unknown Source)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:396)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:953)
> "pool-1-thread-2":
> at org.apache.hadoop.mapred.JobTracker.finalizeJob(JobTracker.java:2017)
> - waiting to lock <0x00007f2b5f026000> (a
> org.apache.hadoop.mapred.JobTracker)
> at
> org.apache.hadoop.mapred.JobInProgress.garbageCollect(JobInProgress.java:2483)
> - locked <0x00007f2b6fb46130> (a org.apache.hadoop.mapred.JobInProgress)
> at
> org.apache.hadoop.mapred.JobInProgress.terminateJob(JobInProgress.java:2152)
> - locked <0x00007f2b6fb46130> (a org.apache.hadoop.mapred.JobInProgress)
> at
> org.apache.hadoop.mapred.JobInProgress.terminate(JobInProgress.java:2169)
> - locked <0x00007f2b6fb46130> (a org.apache.hadoop.mapred.JobInProgress)
> at org.apache.hadoop.mapred.JobInProgress.fail(JobInProgress.java:2245)
> - locked <0x00007f2b6fb46130> (a org.apache.hadoop.mapred.JobInProgress)
> at
> org.apache.hadoop.mapred.EagerTaskInitializationListener$InitJob.run(EagerTaskInitializationListener.java:86)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> at java.lang.Thread.run(Thread.java:619)
> {code}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.