[ 
https://issues.apache.org/jira/browse/GOBBLIN-998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16991831#comment-16991831
 ] 

Chen Guo edited comment on GOBBLIN-998 at 12/9/19 6:19 PM:
-----------------------------------------------------------

Will fix this bug by introducing a new status called PENDING_RETRY, which 
differentiates itself from the first PENDING status. 
 * PENDING is the status which exists between the COMPILED and ORCHESTRATED
 ** Job lands in the PENDING state before it's orchestrated to a SpecExecutor 
and after the flow has been compiled.
 * PENDING_RETRY is also the status existing between the COMPILED and 
ORCHESTRATED
 ** When the job fails and currentAttempt < maxAttempt, its status will be 
reset from FAILED status to PENDING_RETRY, pending on being sent to the 
SpecExecutor again.

 


was (Author: enjoyear):
Will fix this bug by introducing a new status called PENDING_RETRY, which 
differentiates itself from the first PENDING status. 
 * PENDING is the status which exists between the COMPILED and ORCHESTRATED
 ** Job lands in the PENDING state before it's orchestrated to a SpecExecutor 
and after the flow has been compiled.
 * PENDING_RETRY is also the status existing between the COMPILED and 
ORCHESTRATED
 ** When the job fails and currentAttempt < maxAttempt, 

> ExecutionStatus should be reset to PENDING before a job retries
> ---------------------------------------------------------------
>
>                 Key: GOBBLIN-998
>                 URL: https://issues.apache.org/jira/browse/GOBBLIN-998
>             Project: Apache Gobblin
>          Issue Type: Bug
>            Reporter: Chen Guo
>            Priority: Critical
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> In the modifyStateIfRetryRequired of KafkaJobStatusMonitor, when the state is 
> Failed and currentAttempts < maxAttempts, the ExecutionStatus is set to 
> Running. 
> However, due to the checkin from 
> GOBBLIN-974([https://github.com/apache/incubator-gobblin/blob/9f50a2563cc257039da44018663b6b9e119fb499/gobblin-service/src/main/java/org/apache/gobblin/service/monitoring/KafkaJobStatusMonitor.java#L159]),
>  the currentAttempts update from a lower-order event(like Orchestrated) 
> cannot be consumed to update the jobState file. Thus it will cause infinite 
> retries in DagManagerThread for failed jobs when it poolAndAdvanceDag.
>  
> The solution is to update ExecutionStatus to PENDING instead of Running.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to