[ 
https://issues.apache.org/jira/browse/FLINK-21846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-21846:
----------------------------------
    Description: 
Currently, the {{AdaptiveScheduler}} fails a job execution if the 
{{ExecutionGraph}} creation fails. This can be problematic because the failure 
could result from a transient problem (e.g. filesystem is currently not 
available). In the case of a transient problem a job rescaling could lead to a 
job failure which might be a bit surprising for users. Instead, I would expect 
that Flink would retry the {{ExecutionGraph}} creation.

One idea could be to ask the restart policy for how to treat the failure and 
whether to retry the {{ExecutionGraph}} creation or not.

One thing to keep in mind, though, is that some failure might be permanent 
failures (e.g. wrongly specified savepoint path). In such as case we would 
ideally fail immediately. One way to address this problem could be to try to 
restore the savepoint once we create the {{AdaptiveScheduler}}.

  was:
Currently, the {{AdaptiveScheduler}} fails a job execution if the 
{{ExecutionGraph}} creation fails. This can be problematic because the failure 
could result from a transient problem (e.g. filesystem is currently not 
available). In the case of a transient problem a job rescaling could lead to a 
job failure which might be a bit surprising for users. Instead, I would expect 
that Flink would retry the {{ExecutionGraph}} creation.

One idea could be to ask the restart policy for how to treat the failure and 
whether to retry the {{ExecutionGraph}} creation or not.

One thing to keep in mind, though, is that some failure might be permanent 
failures (e.g. wrongly specified savepoint path). In such as case we would 
ideally fail immediately.


> Rethink whether failure of ExecutionGraph creation should directly fail the 
> job
> -------------------------------------------------------------------------------
>
>                 Key: FLINK-21846
>                 URL: https://issues.apache.org/jira/browse/FLINK-21846
>             Project: Flink
>          Issue Type: Sub-task
>          Components: Runtime / Coordination
>    Affects Versions: 1.13.0
>            Reporter: Till Rohrmann
>            Priority: Major
>             Fix For: 1.13.0
>
>
> Currently, the {{AdaptiveScheduler}} fails a job execution if the 
> {{ExecutionGraph}} creation fails. This can be problematic because the 
> failure could result from a transient problem (e.g. filesystem is currently 
> not available). In the case of a transient problem a job rescaling could lead 
> to a job failure which might be a bit surprising for users. Instead, I would 
> expect that Flink would retry the {{ExecutionGraph}} creation.
> One idea could be to ask the restart policy for how to treat the failure and 
> whether to retry the {{ExecutionGraph}} creation or not.
> One thing to keep in mind, though, is that some failure might be permanent 
> failures (e.g. wrongly specified savepoint path). In such as case we would 
> ideally fail immediately. One way to address this problem could be to try to 
> restore the savepoint once we create the {{AdaptiveScheduler}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to