[
https://issues.apache.org/jira/browse/OOZIE-2476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Purshotam Shah updated OOZIE-2476:
----------------------------------
Description:
Noticed multiple time in our production.
If one the action in fork fail with a transient error ( but succeeded after few
retries), they never join.
This happens when on the action is fork fails to submit job.
Oozie queues command as queue(this, retryDelayMillis) on transient error.
ActionStartXCommand doesn't load job if its is not null.
Before ActionStartXCommand runs again, other actions has already started which
has modified job info. ActionStartXCommand still contains old info, which
writes to DB and we miss some workflow instance data.
was:
Noticed multiple time in our production.
If one the action in fork fail with a transient error ( but succeeded after few
retries), they never join.
> When one of the action from fork fails with transient error, WF never joins.
> ----------------------------------------------------------------------------
>
> Key: OOZIE-2476
> URL: https://issues.apache.org/jira/browse/OOZIE-2476
> Project: Oozie
> Issue Type: Bug
> Reporter: Purshotam Shah
> Assignee: Purshotam Shah
> Attachments: OOZIE-2476-V1.patch
>
>
> Noticed multiple time in our production.
> If one the action in fork fail with a transient error ( but succeeded after
> few retries), they never join.
> This happens when on the action is fork fails to submit job.
> Oozie queues command as queue(this, retryDelayMillis) on transient error.
> ActionStartXCommand doesn't load job if its is not null.
> Before ActionStartXCommand runs again, other actions has already started
> which has modified job info. ActionStartXCommand still contains old info,
> which writes to DB and we miss some workflow instance data.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)