[ 
https://issues.apache.org/jira/browse/FLINK-35288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17844106#comment-17844106
 ] 

Biao Geng commented on FLINK-35288:
-----------------------------------

https://cwiki.apache.org/confluence/display/FLINK/FLIP-364%3A+Improve+the+exponential-delay+restart-strategy#FLIP364:Improvetheexponentialdelayrestartstrategy-1.2Differentsemanticsofrestartattemptscauseregionfailovernotasexpected
In the above FLIP, there is some relevant discussion of the 
'restart-strategy.fixed-delay.attempts' problem. When 'region-failover' (the 
default value of *jobmanager.execution.failover-strategy*) is enabled, the 
org.apache.flink.runtime.executiongraph.failover.ExecutionFailureHandler#handleFailure
 method is called once a subtask in a region fails, which consumes the 
job-level 'restart-strategy.fixed-delay.attempts'. As a result, the restart 
strategy may not work as the documentation described.
We have also met such case in the production environment.

> Flink Restart Strategy does not work as documented
> --------------------------------------------------
>
>                 Key: FLINK-35288
>                 URL: https://issues.apache.org/jira/browse/FLINK-35288
>             Project: Flink
>          Issue Type: Bug
>            Reporter: Keshav Kansal
>            Priority: Minor
>
> As per the documentation when using the Fixed Delay Restart Strategy, the
> *restart-strategy.fixed-delay.attempts* defines the "The number of times that 
> Flink retries the execution before the job is declared as failed if has been 
> set to fixed-delay". 
> However in reality it is the *maximum-total-task-failures*, i.e. it is 
> possbile that the job does not even attempt to restart. 
> This is as per documented in 
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-1%3A+Fine+Grained+Recovery+from+Task+Failures
> If there is an outage at a Sink level, for example Elasticsearch outage, all 
> the independent tasks might fail and the job will immediately fail without 
> restart (if restart-strategy.fixed-delay.attempts is set lower or equal to 
> the parallelism of the sink)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to