Rohit Pegallapati created OOZIE-3291:
----------------------------------------

             Summary: Oozie workflow hangs in running state even when the 
underlying action failed
                 Key: OOZIE-3291
                 URL: https://issues.apache.org/jira/browse/OOZIE-3291
             Project: Oozie
          Issue Type: Bug
          Components: workflow
    Affects Versions: 4.1.0
            Reporter: Rohit Pegallapati


We have mutiple distcp actions in fork join. We use hadoop 2.6.0 (cdh 5.5.1). 
We are hittingĀ 

https://issues.apache.org/jira/browse/MAPREDUCE-6478

at this time the distcp action fails with the below exception.
{code:java}
2018-06-10 15:19:39,179 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report 
from attempt_1520068304865_972654_m_000000_0: 
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
 No lease on 
/user/xxx/oozie-oozi/1951586-180303074950833-oozie-oozi-W/distcp-to-dr-0-update-action--distcp/output/_temporary/1/_temporary/attempt_1520068304865_972654_m_000000_0/part-00000
 (inode 192492374): File does not exist. Holder 
DFSClient_NONMAPREDUCE_-2068852542_1 does not have any open files.
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3604)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFileInternal(FSNamesystem.java:3690)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:3660)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.complete(NameNodeRpcServer.java:738)
at 
org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.complete(AuthorizationProviderProxyClientProtocol.java:243)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.complete(ClientNamenodeProtocolServerSideTranslatorPB.java:528)
at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1060)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2086)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2082)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2080)

{code}

At this time we expect that WF should be killed and subsequent WF should start. 
But this WF is stuck in RUNNING state and other WFs get stacked up through the 
coordinator, leaving no option but to kill the running WF. After this defective 
WF is killed, other WF's process perfectly fine  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to