Chesnay Schepler created FLINK-20752:
----------------------------------------
Summary: FailureRateRestartBackoffTimeStrategy allows one less
restart than configured
Key: FLINK-20752
URL: https://issues.apache.org/jira/browse/FLINK-20752
Project: Flink
Issue Type: Bug
Components: Runtime / Coordination
Affects Versions: 1.10.0
Reporter: Chesnay Schepler
Assignee: Chesnay Schepler
Fix For: 1.13.0
The {{FailureRateRestartBackoffTimeStrategy}} maintains a list of failure
timestamps, keeping N timestamps where N is the configured of failures per
interval.
The timestamp is added when #notifyFailure() is called, and later evaluated
within #canRestart().
To determine whether a restart should be allowed we first check whether we are
already storing N timestamps, and if so check whether the earliest failure
still falls within the current interval. If it does, we reject the restart.
The problem is that we check whether we have already stored exactly N
timestamps. If we have exactly N timestamps, and we allow N failures per
interval, then we should be checking whether we have N+1 timestamps have been
stored instead.
For example, let's say we allow 2 exceptions, and 2 have occurred so far.
Regardless of what the timestamps are, we should still allow a restart in this
case.
Only once a third exception occurs should we be looking at the timestamps, and
we should furthermore only look at the exception exceeding the allowed failure
count; in this example it is the very first exception.
I don't know why this went unnoticed for so long; the relevant tests fail
rather reliably for me locally. ({{FailureRateRestartBackoffTimeStrategyTest}},
{{SimpleRecoveryFailureRateStrategyITBase}})
--
This message was sent by Atlassian Jira
(v8.3.4#803005)