Levi Ramsey created FLINK-19773:
-----------------------------------

             Summary: Exponential backoff restart strategy
                 Key: FLINK-19773
                 URL: https://issues.apache.org/jira/browse/FLINK-19773
             Project: Flink
          Issue Type: Improvement
    Affects Versions: 1.11.2
            Reporter: Levi Ramsey


There are situations where the current restart strategies (fixed-delay and 
failure-rate) seem to be suboptimal.  For example, in HDFS sinks, a delay 
between restarts shorter than the lease expiration time in HDFS is going to 
result in many restart attempts which fail, putting somewhat pointless stress 
on a cluster.  On the other hand, setting a delay of close to the lease 
expiration time will mean far more downtime than necessary when the cause of 
failure is something that works itself out quickly.

 

An exponential backoff restart strategy would address this.  For example a 
backoff strategy where the jobs are contending for a lease on a shared resource 
that terminates after 1200 seconds of inactivity might have successive delays 
of 1, 2, 4, 8, 16... 1024 seconds (after which a cumulative delay of more than 
1200 seconds has passed).

While not intrinsically tied to exponential backoff (it's more of an example of 
variable delay), in the case of many jobs failing due to an infrastructure 
failure, a thundering herd scenario can be mitigated by adding jitter to the 
delays, e.g. 0 -> 1 -> 2 -> 3/4/5 -> 5/6/7/8/9/10/11 seconds.  With this 
jitter, eventually a set of jobs competing to restart will spread out.

(logging the ticket more to start a discussion and perhaps get context around 
if this had been considered and rejected, etc.)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to