[ 
https://issues.apache.org/jira/browse/OOZIE-994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13459729#comment-13459729
 ] 

Virag Kothari commented on OOZIE-994:
-------------------------------------

The ActionStartX also checks the status of the hadoop job (check() method) and 
in some cases may fail while the action is RUNNING.

If the action's status is kept as RUNNING and job as SUSPENDED, then I believe 
we wont have a way to distinguish between user issuing a 'suspend' cmd and the 
workflow suspended due to transient error. 
                
> ActionCheckXCommand does not handle failures properly
> -----------------------------------------------------
>
>                 Key: OOZIE-994
>                 URL: https://issues.apache.org/jira/browse/OOZIE-994
>             Project: Oozie
>          Issue Type: Bug
>          Components: workflow
>    Affects Versions: 3.2.0
>            Reporter: Alejandro Abdelnur
>            Assignee: Robert Kanter
>            Priority: Critical
>             Fix For: trunk
>
>         Attachments: OOZIE-994.patch
>
>
> If the JT restarts or dies and running jobs are lost or the JT is not 
> reachable, Oozie ActionCheckXCommand will never fail the workflow job.
> There seem to be 2 issues here:
> * convertException is not receiving the root cause exception anytmore, but 
> alway HadoopAccessorException wrapping the root cause exception. We should 
> modify the convertException to inspect the cause exception as well.
> * ActionCheckXCommand does not do the handle retry logic of 
> ActionStartXCommand.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to