Hi, I think I met with a possible deadlock situation. Not sure whether it is actually a deadlock or not :-) Here is my scenario:
Run a Job and call JobClient.monitorAndPrintJob to monitor the job and get the status update. In parallel try to invoke the JobClient$NetworkedJob.killJob. For reference I am attaching the Thread dump for both the operation: "MrPlanRunner" daemon prio=5 tid=7fe12cacf000 nid=0x11352f000 in Object.wait() [11352d000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on <7f3c55668> (a org.apache.hadoop.ipc.Client$Call) at java.lang.Object.wait(Object.java:485) at org.apache.hadoop.ipc.Client.call(Client.java:1145) - locked <7f3c55668> (a org.apache.hadoop.ipc.Client$Call) at org.apache.hadoop.ipc.Client.call(Client.java:1122) at org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Invoker.invoke(ProtoOverHadoopRpcEngine.java:148) at $Proxy40.getApplicationReport(Unknown Source) at org.apache.hadoop.yarn.api.impl.pb.client.ClientRMProtocolPBClientImpl.getApplicationReport(ClientRMProtocolPBClientImpl.java:116) at org.apache.hadoop.mapred.ResourceMgrDelegate.getApplicationReport(ResourceMgrDelegate.java:343) at org.apache.hadoop.mapred.ClientServiceDelegate.getProxy(ClientServiceDelegate.java:143) at org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:296) - locked <7f4d78950> (a org.apache.hadoop.mapred.ClientServiceDelegate) at org.apache.hadoop.mapred.ClientServiceDelegate.getJobStatus(ClientServiceDelegate.java:373) at org.apache.hadoop.mapred.YARNRunner.getJobStatus(YARNRunner.java:483) at org.apache.hadoop.mapreduce.Job$1.run(Job.java:322) at org.apache.hadoop.mapreduce.Job$1.run(Job.java:319) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1177) at org.apache.hadoop.mapreduce.Job.updateStatus(Job.java:319) - locked <7f4f70fc0> (a org.apache.hadoop.mapreduce.Job) at org.apache.hadoop.mapreduce.Job.isComplete(Job.java:598) at org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1280) at org.apache.hadoop.mapred.JobClient$NetworkedJob.monitorAndPrintJob(JobClient.java:432) at org.apache.hadoop.mapred.JobClient.monitorAndPrintJob(JobClient.java:902) at xxxxx.runJob(xxxxx.java:74) at xxxxx.doExecute(xxxxx.java:39) at xxxxx.doExecute(xxxxx.java:1) at xxxxexecute(xxxxxx.java:29) at xxxx.MrPlanRunner.run(xxxxx.java:117) at java.lang.Thread.run(Thread.java:680) "Thread-2" prio=5 tid=7fe12e2de800 nid=0x114d15000 waiting for monitor entry [114d13000] java.lang.Thread.State: BLOCKED (on object monitor) at org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:286) - waiting to lock <7f4d78950> (a org.apache.hadoop.mapred.ClientServiceDelegate) at org.apache.hadoop.mapred.ClientServiceDelegate.getJobStatus(ClientServiceDelegate.java:373) at org.apache.hadoop.mapred.YARNRunner.killJob(YARNRunner.java:509) at org.apache.hadoop.mapreduce.Job.killJob(Job.java:622) at org.apache.hadoop.mapred.JobClient$NetworkedJob.killJob(JobClient.java:319) - locked <7f4f8fa68> (a org.apache.hadoop.mapred.JobClient$NetworkedJob) at xxxx.cancelCurrentJob(xxxxxx.java:150) at xxxx.cancel(xxxxx.java:171) at xxxx.testCancelJob(xxxx.java:135) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:45) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:42) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20) at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:46) at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:46) at org.junit.internal.runners.statements.FailOnTimeout$StatementThread.run(FailOnTimeout.java:62) In the thread dump we can observe the object "7f4d78950" is being locked by MrPlanRunner(Thread calling JobClient.monitorAndPrintJob) thread and Thread-2(Thread calling JobClient$NetworkedJob.killJob) is trying to make an attempt to lock the same object and gets Blocked. Please let me know if this a possible problem in the code or the usage of API is incorrect. The build being used is:0.23.1-cdh4.0.0b2 Cheers, Subroto Sanyal