[jira] [Commented] (MAPREDUCE-4235) Killing app can lead to inconsistent app status between RM and HS

Jason Lowe (JIRA) Tue, 08 May 2012 15:50:15 -0700

    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-4235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13270908#comment-13270908
 ]


Jason Lowe commented on MAPREDUCE-4235:
---------------------------------------

The ApplicationMaster log will have this exception when it shuts down:

{noformat}
2012-05-08 16:19:34,666 ERROR [Thread-1] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Exception while 
unregistering 
RemoteTrace: 
 at LocalTrace: 
        org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: 
RemoteTrace: 
 at LocalTrace: 
        org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: 
Application doesn't exist in cache appattempt_1336511902223_0001_000001
        at 
org.apache.hadoop.yarn.factories.impl.pb.YarnRemoteExceptionFactoryPBImpl.createYarnRemoteException(YarnRemoteExceptionFactoryPBImpl.java:39)
        at 
org.apache.hadoop.yarn.ipc.RPCUtil.getRemoteException(RPCUtil.java:47)
        at 
org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.finishApplicationMaster(ApplicationMasterService.java:222)
        at 
org.apache.hadoop.yarn.api.impl.pb.service.AMRMProtocolPBServiceImpl.finishApplicationMaster(AMRMProtocolPBServiceImpl.java:69)
        at 
org.apache.hadoop.yarn.proto.AMRMProtocol$AMRMProtocolService$2.callBlockingMethod(AMRMProtocol.java:85)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:427)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:916)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1692)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1687)

        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
        at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
        at 
org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:90)
        at 
org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:57)
        at 
org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl.unwrapAndThrowException(YarnRemoteExceptionPBImpl.java:124)
        at 
org.apache.hadoop.yarn.api.impl.pb.client.AMRMProtocolPBClientImpl.finishApplicationMaster(AMRMProtocolPBClientImpl.java:85)
        at 
org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator.unregister(RMCommunicator.java:190)
        at 
org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator.stop(RMCommunicator.java:216)
        at 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.stop(RMContainerAllocator.java:226)
        at 
org.apache.hadoop.mapreduce.v2.app.MRAppMaster$ContainerAllocatorRouter.stop(MRAppMaster.java:668)
        at 
org.apache.hadoop.yarn.service.CompositeService.stop(CompositeService.java:99)
        at 
org.apache.hadoop.yarn.service.CompositeService.stop(CompositeService.java:89)
        at 
org.apache.hadoop.mapreduce.v2.app.MRAppMaster$MRAppMasterShutdownHook.run(MRAppMaster.java:1036)
        at 
org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)
{noformat}

I believe the following sequence of events leads to the problem:

# AM sees all tasks complete, changes internal job state to SUCCEEDED, triggers 
job finished event (which currently waits 5 seconds and enlarges the race 
window)
# kill client command connects to the AM, sees that the job state != RUNNING, 
then tells RM to kill application
# RM fields kill request, transitions app state from RUNNING to KILLED/KILLED 
and unregisters app.  Leaves tracking URL unchanged (probably should null it 
out as it does for AM's that exit unexpectedly)
# AM starts shutdown, tries to unregister with RM, and RM claims it doesn't 
know about the app (because it already unregistered it internally)
# HS reports app status as SUCCEEDED because jhist file shows job completed 
successfully.

If the RM fields an unregister request for an application that was killed, we 
may want to consider updating the application's status and tracking URL based 
on the unregister request since it is likely to be more accurate (e.g.: 
SUCCEEDED instead of KILLED and tracking URL would point to the history server).
                
> Killing app can lead to inconsistent app status between RM and HS
> -----------------------------------------------------------------
>
>                 Key: MAPREDUCE-4235
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4235
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.3
>            Reporter: Jason Lowe
>
> If a client tries to kill an application that is about to complete, the 
> application states between the ResourceManager's web UI and the history 
> server can be inconsistent.  When the problem occurs, the ResourceManager 
> shows the Status/FinalStatus as KILLED/KILLED and the history link will 
> redirect to a broken link.  The history link still references the 
> ApplicationMaster which is now missing.  The history server entry will show 
> the application state as SUCCEEDED.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-4235) Killing app can lead to inconsistent app status between RM and HS

Reply via email to