[jira] [Updated] (FLINK-35288) Flink Restart Strategy does not work as documented

Keshav Kansal (Jira) Fri, 03 May 2024 21:48:05 -0700


     [ 
https://issues.apache.org/jira/browse/FLINK-35288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Keshav Kansal updated FLINK-35288:
----------------------------------
    Description: 
As per the documentation when using the Fixed Delay Restart Strategy, the
*restart-strategy.fixed-delay.attempts* defines the "The number of times that 
Flink retries the execution before the job is declared as failed if has been 
set to fixed-delay". 

However in reality it is the *maximum-total-task-failures*, i.e. it is possbile 
that the job does not even attempt to restart. 
This is as per documented in 
https://cwiki.apache.org/confluence/display/FLINK/FLIP-1%3A+Fine+Grained+Recovery+from+Task+Failures

If there is an outage at a Sink level, for example Elasticsearch outage, all 
the independent tasks might fail and the job will immediately fail without 
restart (if restart-strategy.fixed-delay.attempts is set lower or equal to the 
parallelism of the sink)


  was:
As per the documentation when using the Fixed Delay Restart Strategy, the
*restart-strategy.fixed-delay.attempts* defines the "The number of times that 
Flink retries the execution before the job is declared as failed if has been 
set to fixed-delay". 

However in reality it is the *maximum-total-task-failures*, i.e. it is possbile 
that the job does not even attempt to restart. 
This is as per documented in 
https://cwiki.apache.org/confluence/display/FLINK/FLIP-1%3A+Fine+Grained+Recovery+from+Task+Failures

If there is an outage at a Sink level, for example Elasticsearch outage, all 
the independent tasks might fail and the job will immediately fail without 
restart if restart-strategy.fixed-delay.attempts is set lower or equal to the 
parallelism of the sink. 



> Flink Restart Strategy does not work as documented
> --------------------------------------------------
>
>                 Key: FLINK-35288
>                 URL: https://issues.apache.org/jira/browse/FLINK-35288
>             Project: Flink
>          Issue Type: Bug
>            Reporter: Keshav Kansal
>            Priority: Minor
>
> As per the documentation when using the Fixed Delay Restart Strategy, the
> *restart-strategy.fixed-delay.attempts* defines the "The number of times that 
> Flink retries the execution before the job is declared as failed if has been 
> set to fixed-delay". 
> However in reality it is the *maximum-total-task-failures*, i.e. it is 
> possbile that the job does not even attempt to restart. 
> This is as per documented in 
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-1%3A+Fine+Grained+Recovery+from+Task+Failures
> If there is an outage at a Sink level, for example Elasticsearch outage, all 
> the independent tasks might fail and the job will immediately fail without 
> restart (if restart-strategy.fixed-delay.attempts is set lower or equal to 
> the parallelism of the sink)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (FLINK-35288) Flink Restart Strategy does not work as documented

Reply via email to