We have been seeing jobs not stopping after executing ' yarn application --kill <id>'. The yarn master looks to go through the effort of killing the job. The job is no longer listed via 'yarn application --list'. The job however continues to process messages. We have to chase down the rogue process and kill it. Once we enter this state, we also have problems with the scripts in hadoop/sbin. Stop yarn finds no resource managers. We have to manually 'ps' and kill resource manager processes.
We have seen this problem several times now in production. This is causing us to to process messages when we think processing is stopped. Has anyone seen this or have any insight ? Yarn: 2.6 Samza: 0.10.1 We do see the stack trace below in the Application Manager after killing the job in question: 2017-01-23 18:30:26 SamzaAppMaster$ [WARN] Listener org.apache.samza.job.yarn.SamzaAppMasterLifecycle@9fc9f91 failed to shutdown. org.apache.hadoop.security.token.SecretManager$InvalidToken: appattempt_1481305736678_0002_000001 not found in AMRMTokenSecretManager. at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:422) at org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53) at org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:104) at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.finishApplicationMaster(ApplicationMasterProtocolPBClientImpl.java:94) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) at com.sun.proxy.$Proxy16.finishApplicationMaster(Unknown Source) at org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.unregisterApplicationMaster(AMRMClientImpl.java:378) at org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl.unregisterApplicationMaster(AMRMClientAsyncImpl.java:157) at org.apache.samza.job.yarn.SamzaAppMasterLifecycle.onShutdown(SamzaAppMasterLifecycle.scala:63) at org.apache.samza.job.yarn.SamzaAppMaster$$anonfun$run$5.apply(SamzaAppMaster.scala:145) at org.apache.samza.job.yarn.SamzaAppMaster$$anonfun$run$5.apply(SamzaAppMaster.scala:144) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.samza.job.yarn.SamzaAppMaster$.run(SamzaAppMaster.scala:144) at org.apache.samza.job.yarn.SamzaAppMaster$.main(SamzaAppMaster.scala:112) at org.apache.samza.job.yarn.SamzaAppMaster.main(SamzaAppMaster.scala) Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): appattempt_1481305736678_0002_000001 not found in AMRMTokenSecretManager. at org.apache.hadoop.ipc.Client.call(Client.java:1469) at org.apache.hadoop.ipc.Client.call(Client.java:1400) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232) at com.sun.proxy.$Proxy15.finishApplicationMaster(Unknown Source) at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.finishApplicationMaster(ApplicationMasterProtocolPBClientImpl.java:91) ... 16 more Jeremiah Adams Software Engineer www.helixeducation.com<http://www.helixeducation.com/> Blog<http://www.helixeducation.com/blog/> | Twitter<https://twitter.com/HelixEducation> | Facebook<https://www.facebook.com/HelixEducation> | LinkedIn<http://www.linkedin.com/company/3609946>