[ 
https://issues.apache.org/jira/browse/FLINK-13452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16896912#comment-16896912
 ] 

Zhu Zhu edited comment on FLINK-13452 at 7/31/19 8:27 AM:
----------------------------------------------------------

I think this happens because current implemented RestartStrategies do not 
handle exceptions from the restart callback. If it is by design, all 
_RestartCallback_ implementations should catch all exceptions in 
_triggerFullRecovery_ and handle them to recover the job (via failGlobal maybe).

For this case, fix the 
_AdaptedRestartPipelinedRegionStrategyNG#createResetAndRescheduleTasksCallback_ 
to catch all exceptions and invoke failGlobal to recover would be a proper 
choice.

As for _failGlobal_, IMO, any exception directly thrown from it is unexpected. 
You can explicitly use _FatalExitExceptionHandler_ to handle this unexpected 
case, which is aligned with what in 
_AdaptedRestartPipelinedRegionStrategyNG#restartTasks_. 


was (Author: zhuzh):
I think this happens because current implemented RestartStrategies do not 
handle exceptions from the restart callback. If it is by design, all 
_*RestartCallback*_ implementations should catch all exceptions in 
_*triggerFullRecovery*_ and handle them to recover the job (via failGlobal 
maybe).

For this case, fix the 
AdaptedRestartPipelinedRegionStrategyNG#createResetAndRescheduleTasksCallback 
to catch all exceptions and invoke failGlobal to recover would be a proper 
choice.

For the failGlobal, IMO, any exception directly thrown from it is unexpected. 
You can explicitly use FatalExitExceptionHandler to handle this unexpected 
case, which is aligned with what in 
AdaptedRestartPipelinedRegionStrategyNG#restartTasks

 

> Pipelined region failover strategy does not recover Job if checkpoint cannot 
> be read
> ------------------------------------------------------------------------------------
>
>                 Key: FLINK-13452
>                 URL: https://issues.apache.org/jira/browse/FLINK-13452
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.9.0, 1.10.0
>            Reporter: Gary Yao
>            Assignee: Yun Tang
>            Priority: Blocker
>              Labels: pull-request-available
>             Fix For: 1.9.0
>
>         Attachments: jobmanager.log
>
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> The job does not recover if a checkpoint cannot be read and 
> {{jobmanager.execution.failover-strategy}} is set to _"region"_. 
> *Analysis*
> The {{RestartCallback}} created by 
> {{AdaptedRestartPipelinedRegionStrategyNG}} throws a \{{RuntimeException}} if 
> no checkpoints could be read. When the restart is invoked in a separate 
> thread pool, the exception is swallowed. See:
> [https://github.com/apache/flink/blob/21621fbcde534969b748f21e9f8983e3f4e0fb1d/flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/failover/AdaptedRestartPipelinedRegionStrategyNG.java#L117-L119]
> [https://github.com/apache/flink/blob/21621fbcde534969b748f21e9f8983e3f4e0fb1d/flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/restart/FixedDelayRestartStrategy.java#L65]
> *Expected behavior*
>  * Job should restart
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

Reply via email to