Rohit Pegallapati created OOZIE-3291:
----------------------------------------
Summary: Oozie workflow hangs in running state even when the
underlying action failed
Key: OOZIE-3291
URL: https://issues.apache.org/jira/browse/OOZIE-3291
Project: Oozie
Issue Type: Bug
Components: workflow
Affects Versions: 4.1.0
Reporter: Rohit Pegallapati
We have mutiple distcp actions in fork join. We use hadoop 2.6.0 (cdh 5.5.1).
We are hittingĀ
https://issues.apache.org/jira/browse/MAPREDUCE-6478
at this time the distcp action fails with the below exception.
{code:java}
2018-06-10 15:19:39,179 INFO [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report
from attempt_1520068304865_972654_m_000000_0:
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
No lease on
/user/xxx/oozie-oozi/1951586-180303074950833-oozie-oozi-W/distcp-to-dr-0-update-action--distcp/output/_temporary/1/_temporary/attempt_1520068304865_972654_m_000000_0/part-00000
(inode 192492374): File does not exist. Holder
DFSClient_NONMAPREDUCE_-2068852542_1 does not have any open files.
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3604)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFileInternal(FSNamesystem.java:3690)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:3660)
at
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.complete(NameNodeRpcServer.java:738)
at
org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.complete(AuthorizationProviderProxyClientProtocol.java:243)
at
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.complete(ClientNamenodeProtocolServerSideTranslatorPB.java:528)
at
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1060)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2086)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2082)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2080)
{code}
At this time we expect that WF should be killed and subsequent WF should start.
But this WF is stuck in RUNNING state and other WFs get stacked up through the
coordinator, leaving no option but to kill the running WF. After this defective
WF is killed, other WF's process perfectly fine
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)