Levi Ramsey created FLINK-19773:
-----------------------------------
Summary: Exponential backoff restart strategy
Key: FLINK-19773
URL: https://issues.apache.org/jira/browse/FLINK-19773
Project: Flink
Issue Type: Improvement
Affects Versions: 1.11.2
Reporter: Levi Ramsey
There are situations where the current restart strategies (fixed-delay and
failure-rate) seem to be suboptimal. For example, in HDFS sinks, a delay
between restarts shorter than the lease expiration time in HDFS is going to
result in many restart attempts which fail, putting somewhat pointless stress
on a cluster. On the other hand, setting a delay of close to the lease
expiration time will mean far more downtime than necessary when the cause of
failure is something that works itself out quickly.
An exponential backoff restart strategy would address this. For example a
backoff strategy where the jobs are contending for a lease on a shared resource
that terminates after 1200 seconds of inactivity might have successive delays
of 1, 2, 4, 8, 16... 1024 seconds (after which a cumulative delay of more than
1200 seconds has passed).
While not intrinsically tied to exponential backoff (it's more of an example of
variable delay), in the case of many jobs failing due to an infrastructure
failure, a thundering herd scenario can be mitigated by adding jitter to the
delays, e.g. 0 -> 1 -> 2 -> 3/4/5 -> 5/6/7/8/9/10/11 seconds. With this
jitter, eventually a set of jobs competing to restart will spread out.
(logging the ticket more to start a discussion and perhaps get context around
if this had been considered and rejected, etc.)
--
This message was sent by Atlassian Jira
(v8.3.4#803005)