[
https://issues.apache.org/jira/browse/FLINK-13452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16896912#comment-16896912
]
Zhu Zhu edited comment on FLINK-13452 at 7/31/19 8:27 AM:
----------------------------------------------------------
I think this happens because current implemented RestartStrategies do not
handle exceptions from the restart callback. If it is by design, all
_RestartCallback_ implementations should catch all exceptions in
_triggerFullRecovery_ and handle them to recover the job (via failGlobal maybe).
For this case, fix the
_AdaptedRestartPipelinedRegionStrategyNG#createResetAndRescheduleTasksCallback_
to catch all exceptions and invoke failGlobal to recover would be a proper
choice.
As for _failGlobal_, IMO, any exception directly thrown from it is unexpected.
You can explicitly use _FatalExitExceptionHandler_ to handle this unexpected
case, which is aligned with what in
_AdaptedRestartPipelinedRegionStrategyNG#restartTasks_.
was (Author: zhuzh):
I think this happens because current implemented RestartStrategies do not
handle exceptions from the restart callback. If it is by design, all
_*RestartCallback*_ implementations should catch all exceptions in
_*triggerFullRecovery*_ and handle them to recover the job (via failGlobal
maybe).
For this case, fix the
AdaptedRestartPipelinedRegionStrategyNG#createResetAndRescheduleTasksCallback
to catch all exceptions and invoke failGlobal to recover would be a proper
choice.
For the failGlobal, IMO, any exception directly thrown from it is unexpected.
You can explicitly use FatalExitExceptionHandler to handle this unexpected
case, which is aligned with what in
AdaptedRestartPipelinedRegionStrategyNG#restartTasks
> Pipelined region failover strategy does not recover Job if checkpoint cannot
> be read
> ------------------------------------------------------------------------------------
>
> Key: FLINK-13452
> URL: https://issues.apache.org/jira/browse/FLINK-13452
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Coordination
> Affects Versions: 1.9.0, 1.10.0
> Reporter: Gary Yao
> Assignee: Yun Tang
> Priority: Blocker
> Labels: pull-request-available
> Fix For: 1.9.0
>
> Attachments: jobmanager.log
>
> Time Spent: 20m
> Remaining Estimate: 0h
>
> The job does not recover if a checkpoint cannot be read and
> {{jobmanager.execution.failover-strategy}} is set to _"region"_.
> *Analysis*
> The {{RestartCallback}} created by
> {{AdaptedRestartPipelinedRegionStrategyNG}} throws a \{{RuntimeException}} if
> no checkpoints could be read. When the restart is invoked in a separate
> thread pool, the exception is swallowed. See:
> [https://github.com/apache/flink/blob/21621fbcde534969b748f21e9f8983e3f4e0fb1d/flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/failover/AdaptedRestartPipelinedRegionStrategyNG.java#L117-L119]
> [https://github.com/apache/flink/blob/21621fbcde534969b748f21e9f8983e3f4e0fb1d/flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/restart/FixedDelayRestartStrategy.java#L65]
> *Expected behavior*
> * Job should restart
>
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)