[
https://issues.apache.org/jira/browse/MAPREDUCE-4235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13270908#comment-13270908
]
Jason Lowe commented on MAPREDUCE-4235:
---------------------------------------
The ApplicationMaster log will have this exception when it shuts down:
{noformat}
2012-05-08 16:19:34,666 ERROR [Thread-1]
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Exception while
unregistering
RemoteTrace:
at LocalTrace:
org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl:
RemoteTrace:
at LocalTrace:
org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl:
Application doesn't exist in cache appattempt_1336511902223_0001_000001
at
org.apache.hadoop.yarn.factories.impl.pb.YarnRemoteExceptionFactoryPBImpl.createYarnRemoteException(YarnRemoteExceptionFactoryPBImpl.java:39)
at
org.apache.hadoop.yarn.ipc.RPCUtil.getRemoteException(RPCUtil.java:47)
at
org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.finishApplicationMaster(ApplicationMasterService.java:222)
at
org.apache.hadoop.yarn.api.impl.pb.service.AMRMProtocolPBServiceImpl.finishApplicationMaster(AMRMProtocolPBServiceImpl.java:69)
at
org.apache.hadoop.yarn.proto.AMRMProtocol$AMRMProtocolService$2.callBlockingMethod(AMRMProtocol.java:85)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:427)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:916)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1692)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1687)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
at
org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:90)
at
org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:57)
at
org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl.unwrapAndThrowException(YarnRemoteExceptionPBImpl.java:124)
at
org.apache.hadoop.yarn.api.impl.pb.client.AMRMProtocolPBClientImpl.finishApplicationMaster(AMRMProtocolPBClientImpl.java:85)
at
org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator.unregister(RMCommunicator.java:190)
at
org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator.stop(RMCommunicator.java:216)
at
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.stop(RMContainerAllocator.java:226)
at
org.apache.hadoop.mapreduce.v2.app.MRAppMaster$ContainerAllocatorRouter.stop(MRAppMaster.java:668)
at
org.apache.hadoop.yarn.service.CompositeService.stop(CompositeService.java:99)
at
org.apache.hadoop.yarn.service.CompositeService.stop(CompositeService.java:89)
at
org.apache.hadoop.mapreduce.v2.app.MRAppMaster$MRAppMasterShutdownHook.run(MRAppMaster.java:1036)
at
org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)
{noformat}
I believe the following sequence of events leads to the problem:
# AM sees all tasks complete, changes internal job state to SUCCEEDED, triggers
job finished event (which currently waits 5 seconds and enlarges the race
window)
# kill client command connects to the AM, sees that the job state != RUNNING,
then tells RM to kill application
# RM fields kill request, transitions app state from RUNNING to KILLED/KILLED
and unregisters app. Leaves tracking URL unchanged (probably should null it
out as it does for AM's that exit unexpectedly)
# AM starts shutdown, tries to unregister with RM, and RM claims it doesn't
know about the app (because it already unregistered it internally)
# HS reports app status as SUCCEEDED because jhist file shows job completed
successfully.
If the RM fields an unregister request for an application that was killed, we
may want to consider updating the application's status and tracking URL based
on the unregister request since it is likely to be more accurate (e.g.:
SUCCEEDED instead of KILLED and tracking URL would point to the history server).
> Killing app can lead to inconsistent app status between RM and HS
> -----------------------------------------------------------------
>
> Key: MAPREDUCE-4235
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4235
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: mrv2
> Affects Versions: 0.23.3
> Reporter: Jason Lowe
>
> If a client tries to kill an application that is about to complete, the
> application states between the ResourceManager's web UI and the history
> server can be inconsistent. When the problem occurs, the ResourceManager
> shows the Status/FinalStatus as KILLED/KILLED and the history link will
> redirect to a broken link. The history link still references the
> ApplicationMaster which is now missing. The history server entry will show
> the application state as SUCCEEDED.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira