[jira] [Updated] (FLINK-20752) FailureRateRestartBackoffTimeStrategy allows one less restart than configured

Chesnay Schepler (Jira) Wed, 23 Dec 2020 03:52:06 -0800


     [ 
https://issues.apache.org/jira/browse/FLINK-20752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Chesnay Schepler updated FLINK-20752:
-------------------------------------
    Description: 
The {{FailureRateRestartBackoffTimeStrategy}} maintains a list of failure 
timestamps, keeping N timestamps where N is the configured of failures per 
interval.

The timestamp is added when #notifyFailure() is called, and later evaluated 
within #canRestart().

To determine whether a restart should be allowed we first check whether we are 
already storing N timestamps, and if so check whether the earliest failure 
still falls within the current interval. If it does, we reject the restart.

The problem is that we check whether we have already stored exactly N 
timestamps. If we have exactly N timestamps, and we allow N failures per 
interval, then we should not reject the restart by definition. We should 
instead be checking whether N+1 timestamps have been stored.

For example, let's say we allow 2 exceptions, and 2 have occurred so far. 
Regardless of what the timestamps are, we should still allow a restart in this 
case.
Only once a third exception occurs should we be looking at the timestamps, and 
we should furthermore only look at the exception exceeding the allowed failure 
count; in this example it is the very first exception.

I don't know why this went unnoticed for so long; the relevant tests fail 
rather reliably for me locally. ({{FailureRateRestartBackoffTimeStrategyTest}}, 
{{SimpleRecoveryFailureRateStrategyITBase}})

 

  was:
The {{FailureRateRestartBackoffTimeStrategy}} maintains a list of failure 
timestamps, keeping N timestamps where N is the configured of failures per 
interval.

The timestamp is added when #notifyFailure() is called, and later evaluated 
within #canRestart().

To determine whether a restart should be allowed we first check whether we are 
already storing N timestamps, and if so check whether the earliest failure 
still falls within the current interval. If it does, we reject the restart.

The problem is that we check whether we have already stored exactly N 
timestamps. If we have exactly N timestamps, and we allow N failures per 
interval, then we should be checking whether we have N+1 timestamps have been 
stored instead.

For example, let's say we allow 2 exceptions, and 2 have occurred so far. 
Regardless of what the timestamps are, we should still allow a restart in this 
case.
Only once a third exception occurs should we be looking at the timestamps, and 
we should furthermore only look at the exception exceeding the allowed failure 
count; in this example it is the very first exception.

I don't know why this went unnoticed for so long; the relevant tests fail 
rather reliably for me locally. ({{FailureRateRestartBackoffTimeStrategyTest}}, 
{{SimpleRecoveryFailureRateStrategyITBase}})

 


> FailureRateRestartBackoffTimeStrategy allows one less restart than configured
> -----------------------------------------------------------------------------
>
>                 Key: FLINK-20752
>                 URL: https://issues.apache.org/jira/browse/FLINK-20752
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.10.0
>            Reporter: Chesnay Schepler
>            Assignee: Chesnay Schepler
>            Priority: Minor
>             Fix For: 1.13.0
>
>
> The {{FailureRateRestartBackoffTimeStrategy}} maintains a list of failure 
> timestamps, keeping N timestamps where N is the configured of failures per 
> interval.
> The timestamp is added when #notifyFailure() is called, and later evaluated 
> within #canRestart().
> To determine whether a restart should be allowed we first check whether we 
> are already storing N timestamps, and if so check whether the earliest 
> failure still falls within the current interval. If it does, we reject the 
> restart.
> The problem is that we check whether we have already stored exactly N 
> timestamps. If we have exactly N timestamps, and we allow N failures per 
> interval, then we should not reject the restart by definition. We should 
> instead be checking whether N+1 timestamps have been stored.
> For example, let's say we allow 2 exceptions, and 2 have occurred so far. 
> Regardless of what the timestamps are, we should still allow a restart in 
> this case.
> Only once a third exception occurs should we be looking at the timestamps, 
> and we should furthermore only look at the exception exceeding the allowed 
> failure count; in this example it is the very first exception.
> I don't know why this went unnoticed for so long; the relevant tests fail 
> rather reliably for me locally. 
> ({{FailureRateRestartBackoffTimeStrategyTest}}, 
> {{SimpleRecoveryFailureRateStrategyITBase}})
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (FLINK-20752) FailureRateRestartBackoffTimeStrategy allows one less restart than configured

Reply via email to