[jira] [Created] (FLINK-39967) FlinkStateSnapshot with default backoffLimit=-1 (documented as "unlimited retries") never retries and fails immediately on first error

Santwana Verma (Jira) Mon, 22 Jun 2026 05:29:11 -0700

Santwana Verma created FLINK-39967:
--------------------------------------

             Summary: FlinkStateSnapshot with default backoffLimit=-1 
(documented as "unlimited retries") never retries and fails immediately on 
first error
                 Key: FLINK-39967
                 URL: https://issues.apache.org/jira/browse/FLINK-39967
             Project: Flink
          Issue Type: Bug
          Components: Kubernetes Operator
            Reporter: Santwana Verma



`FlinkStateSnapshotSpec.backoffLimit` defaults to -1, which is documented as 
meaning unlimited retries:
```
/**
   * Maximum number of retries before the snapshot is considered as failed. Set 
to -1 for
   * unlimited or 0 for no retries.
   */
  private int backoffLimit = -1;
```
However, the retry decision in FlinkStateSnapshotController does a plain 
numeric comparison:
```

  if (resource.getStatus().getFailures() > 
resource.getSpec().getBackoffLimit()) {
      // give up, .withNoRetry()
  }
```
With the default backoffLimit = -1, after the very first failure getFailures() 
is 1, so 1 > -1 evaluates to true and the snapshot is immediately marked as 
failed with no retry. This is the exact opposite of the documented behavior. 
More generally, any negative backoffLimit and the sentinel -1 are not handled 
specially, so the contract is never honored.

Steps to reproduce:
  1. Create a FlinkStateSnapshot (savepoint or checkpoint) without setting 
backoffLimit (defaults to -1).
  2. Cause the snapshot to fail once (e.g. unreachable JobManager / transient 
error).
  3. Observe the snapshot is marked failed with "won't be retried as failure 
count exceeded the backoff limit" instead of retrying.

Expected behavior: With backoffLimit = -1, the snapshot should be retried 
indefinitely (with the existing exponential backoff). backoffLimit = 0 should 
mean no retries; backoffLimit = N should allow up to N retries.

Actual behavior: Snapshot fails immediately after the first error and is never 
retried.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (FLINK-39967) FlinkStateSnapshot with default backoffLimit=-1 (documented as "unlimited retries") never retries and fails immediately on first error

Reply via email to