[
https://issues.apache.org/jira/browse/FLINK-13452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16896912#comment-16896912
]
Zhu Zhu commented on FLINK-13452:
---------------------------------
I think this happens because current implemented RestartStrategies do not
handle exceptions from the restart callback. If it is by design, all
_*RestartCallback*_ implementations should catch all exceptions in
_*triggerFullRecovery*_ and handle them to recover the job (via failGlobal
maybe).
For this case, fix the
AdaptedRestartPipelinedRegionStrategyNG#createResetAndRescheduleTasksCallback
to catch all exceptions and invoke failGlobal to recover would be a proper
choice.
For the failGlobal, IMO, any exception directly thrown from it is unexpected.
You can explicitly use FatalExitExceptionHandler to handle this unexpected
case, which is aligned with what in
AdaptedRestartPipelinedRegionStrategyNG#restartTasks
> Pipelined region failover strategy does not recover Job if checkpoint cannot
> be read
> ------------------------------------------------------------------------------------
>
> Key: FLINK-13452
> URL: https://issues.apache.org/jira/browse/FLINK-13452
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Coordination
> Affects Versions: 1.9.0, 1.10.0
> Reporter: Gary Yao
> Assignee: Yun Tang
> Priority: Blocker
> Labels: pull-request-available
> Fix For: 1.9.0
>
> Attachments: jobmanager.log
>
> Time Spent: 20m
> Remaining Estimate: 0h
>
> The job does not recover if a checkpoint cannot be read and
> {{jobmanager.execution.failover-strategy}} is set to _"region"_.
> *Analysis*
> The {{RestartCallback}} created by
> {{AdaptedRestartPipelinedRegionStrategyNG}} throws a \{{RuntimeException}} if
> no checkpoints could be read. When the restart is invoked in a separate
> thread pool, the exception is swallowed. See:
> [https://github.com/apache/flink/blob/21621fbcde534969b748f21e9f8983e3f4e0fb1d/flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/failover/AdaptedRestartPipelinedRegionStrategyNG.java#L117-L119]
> [https://github.com/apache/flink/blob/21621fbcde534969b748f21e9f8983e3f4e0fb1d/flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/restart/FixedDelayRestartStrategy.java#L65]
> *Expected behavior*
> * Job should restart
>
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)