[
https://issues.apache.org/jira/browse/OOZIE-3366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16647181#comment-16647181
]
Satish Subhashrao Saley commented on OOZIE-3366:
------------------------------------------------
I co-related the logs and the part of code, it seems we are not suspending the
parent WF if subworkflow gets suspended.
Logs:
{code}
2018-04-23 02:15:25,620 WARN ActionStartXCommand:523 [pool-12-thread-224] -
SERVER[wf322] USER[saley] GROUP[users] TOKEN[] APP[saleyapp]
JOB[123-123-oozie-saley--W] ACTION[123-123-oozie-saley--W@saleyapp] Error
starting action [saleyapp]. ErrorType [NON_TRANSIENT], ErrorCode [JA002],
Message [JA002: User: wrkflow is not allowed to impersonate saley]
2018-04-23 02:15:25,620 WARN ActionStartXCommand:523 [pool-12-thread-224] -
SERVER[wf322] USER[saley] GROUP[users] TOKEN[] APP[saleyapp]
JOB[123-123-oozie-saley--W] ACTION[123-123-oozie-saley--W@saleyapp] Suspending
Workflow Job id=123-123-oozie-saley--W
2018-04-23 02:15:25,622 DEBUG LiteWorkflowInstance:526 [pool-12-thread-224] -
SERVER[wf322] USER[saley] GROUP[users] TOKEN[] APP[saleyapp]
JOB[123-123-oozie-saley--W] ACTION[123-123-oozie-saley--W@saleyapp] Suspending
job
{code}
While starting the action, we get non transient exception.
https://github.com/apache/oozie/blob/master/core/src/main/java/org/apache/oozie/command/wf/ActionStartXCommand.java#L290-L305
{code}
ActionStartXCommand.java
catch (ActionExecutorException ex) {
LOG.warn("Error starting action [\{0}]. ErrorType [\{1}], ErrorCode [\{2}],
Message [\{3}]",
wfAction.getName(), ex.getErrorType(), ex.getErrorCode(), ex.getMessage(), ex);
wfAction.setErrorInfo(ex.getErrorCode(), ex.getMessage());
switch (ex.getErrorType()) {
case TRANSIENT:
if (!handleTransient(context, executor, WorkflowAction.Status.START_RETRY)) {
handleNonTransient(context, executor, WorkflowAction.Status.START_MANUAL);
wfAction.setPendingAge(new Date());
wfAction.setRetries(0);
wfAction.setStartTime(null);
}
break;
case NON_TRANSIENT:
handleNonTransient(context, executor, WorkflowAction.Status.START_MANUAL);
{code}
We put the workflow action in START_MANUAL and suspend the workflow.
https://github.com/apache/oozie/blob/master/core/src/main/java/org/apache/oozie/command/wf/ActionXCommand.java#L125-L144
{code}
ActionXCommand.java
protected void handleNonTransient(ActionExecutor.Context context,
ActionExecutor executor,WorkflowAction.Status status) throws CommandException {
ActionExecutorContext aContext = (ActionExecutorContext) context;
WorkflowActionBean action = (WorkflowActionBean) aContext.getAction();
incrActionErrorCounter(action.getType(), "nontransient", 1);
WorkflowJobBean workflow = (WorkflowJobBean) context.getWorkflow();
String id = workflow.getId();
action.setStatus(status);
action.resetPendingOnly();
LOG.warn("Suspending Workflow Job id=" + id);
try {
SuspendXCommand.suspendJob(Services.get().get(JPAService.class), workflow, id,
action.getId(), null);
}
catch (Exception e) {
throw new CommandException(ErrorCode.E0727, id, e.getMessage());
}
finally {
updateParentIfNecessary(workflow, 3);
}
}
{code}
While updating the parent's status, we don't consider the case where a
workflow's parent can be another workflow.
https://github.com/apache/oozie/blob/master/core/src/main/java/org/apache/oozie/command/wf/WorkflowXCommand.java#L92-L97
{code}
WorkflowXCommand.java
protected void updateParentIfNecessary(WorkflowJobBean wfjob, int maxRetries)
throws CommandException {
// update coordinator action if the wf was actually started by a coord
if (wfjob.getParentId() != null && wfjob.getParentId().contains("-C@")) {
new CoordActionUpdateXCommand(wfjob, maxRetries).call();
}
}
{code}
> Update workflow status and subworkflow status on suspend command
> ----------------------------------------------------------------
>
> Key: OOZIE-3366
> URL: https://issues.apache.org/jira/browse/OOZIE-3366
> Project: Oozie
> Issue Type: Bug
> Reporter: Satish Subhashrao Saley
> Assignee: Satish Subhashrao Saley
> Priority: Major
>
> Currently, when subworkflow gets suspended, its corresponding workflow status
> is not updated correctly. Also, when a coord is suspended, the subworkflows
> are not suspended. We need to fix this.
>
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)