Santwana Verma created FLINK-39967:
--------------------------------------
Summary: FlinkStateSnapshot with default backoffLimit=-1
(documented as "unlimited retries") never retries and fails immediately on
first error
Key: FLINK-39967
URL: https://issues.apache.org/jira/browse/FLINK-39967
Project: Flink
Issue Type: Bug
Components: Kubernetes Operator
Reporter: Santwana Verma
`FlinkStateSnapshotSpec.backoffLimit` defaults to -1, which is documented as
meaning unlimited retries:
```
/**
* Maximum number of retries before the snapshot is considered as failed. Set
to -1 for
* unlimited or 0 for no retries.
*/
private int backoffLimit = -1;
```
However, the retry decision in FlinkStateSnapshotController does a plain
numeric comparison:
```
if (resource.getStatus().getFailures() >
resource.getSpec().getBackoffLimit()) {
// give up, .withNoRetry()
}
```
With the default backoffLimit = -1, after the very first failure getFailures()
is 1, so 1 > -1 evaluates to true and the snapshot is immediately marked as
failed with no retry. This is the exact opposite of the documented behavior.
More generally, any negative backoffLimit and the sentinel -1 are not handled
specially, so the contract is never honored.
Steps to reproduce:
1. Create a FlinkStateSnapshot (savepoint or checkpoint) without setting
backoffLimit (defaults to -1).
2. Cause the snapshot to fail once (e.g. unreachable JobManager / transient
error).
3. Observe the snapshot is marked failed with "won't be retried as failure
count exceeded the backoff limit" instead of retrying.
Expected behavior: With backoffLimit = -1, the snapshot should be retried
indefinitely (with the existing exponential backoff). backoffLimit = 0 should
mean no retries; backoffLimit = N should allow up to N retries.
Actual behavior: Snapshot fails immediately after the first error and is never
retried.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)