[
https://issues.apache.org/jira/browse/FLINK-21846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17303997#comment-17303997
]
Till Rohrmann commented on FLINK-21846:
---------------------------------------
Wrongly specified savepoints paths cannot happen during rescaling. However,
currently, rescaling and an initial start of the job are treated identical
(waiting for resources and then starting the job => same code paths).
What can go wrong during a rescaling event are all I/O operations (for example
access to ZooKeeper/K8s Ha store, running of MasterHooks which interact with
other systems).
For the first run when we need to recover the checkpoints we also need to read
the metadata checkpoint file from disk.
> Rethink whether failure of ExecutionGraph creation should directly fail the
> job
> -------------------------------------------------------------------------------
>
> Key: FLINK-21846
> URL: https://issues.apache.org/jira/browse/FLINK-21846
> Project: Flink
> Issue Type: Sub-task
> Components: Runtime / Coordination
> Affects Versions: 1.13.0
> Reporter: Till Rohrmann
> Priority: Major
> Fix For: 1.13.0
>
>
> Currently, the {{AdaptiveScheduler}} fails a job execution if the
> {{ExecutionGraph}} creation fails. This can be problematic because the
> failure could result from a transient problem (e.g. filesystem is currently
> not available). In the case of a transient problem a job rescaling could lead
> to a job failure which might be a bit surprising for users. Instead, I would
> expect that Flink would retry the {{ExecutionGraph}} creation.
> One idea could be to ask the restart policy for how to treat the failure and
> whether to retry the {{ExecutionGraph}} creation or not.
> One thing to keep in mind, though, is that some failure might be permanent
> failures (e.g. wrongly specified savepoint path). In such as case we would
> ideally fail immediately. One way to address this problem could be to try to
> restore the savepoint once we create the {{AdaptiveScheduler}}.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)