gaborgsomogyi opened a new pull request, #28570:
URL: https://github.com/apache/flink/pull/28570

   ## What is the purpose of the change
   
   The delegation token renewal manager used a fixed 1-hour retry backoff after 
any failure, which was designed for Kerberos tickets with multi-hour lifetimes. 
Modern cloud credential providers (e.g. AWS STS) issue tokens with TTLs as 
short as 15 minutes, making a 1-hour pause after a single failure unacceptable 
- the token would expire long before the next retry attempt.
   
   ## Brief change log
   
   - Replace the single fixed 
`security.delegation.tokens.renewal.retry.backoff` option with two new options: 
`security.delegation.tokens.renewal.retry.initial.backoff` (default 10 s) and 
`security.delegation.tokens.renewal.retry.max.backoff` (default 5 min); the old 
key is kept as a deprecated alias for backward compatibility.
   - Implement exponential backoff in `DefaultDelegationTokenManager`: the 
retry delay starts at the initial backoff and doubles on each consecutive 
failure up to the configured maximum.
   - Add ±50% jitter to retry delays to prevent thundering herds when many 
instances fail simultaneously.
   - Add TTL-relative bounding: when the last known token expiry is available, 
cap each retry delay to one third of the remaining valid window so that a retry 
always happens before the token expires, regardless of how late in the token's 
lifetime the failure occurred.
   - Extract the retry delay calculation into a dedicated 
`calculateRetryDelay(Clock)` method, consistent with the existing 
`calculateRenewalDelay(Clock, long)` pattern.
   - Reset the exponential backoff state on each successful token renewal.
   - Use `TimeUtils.formatWithHighestUnit` in log messages instead of raw 
millisecond values.
   - Remove deprecated Kerberos config key aliases 
(`security.kerberos.tokens.renewal.retry.backoff`, 
`security.kerberos.tokens.renewal.time-ratio`, 
`security.kerberos.fetch.delegation-token`) that have been eligible for removal 
since Flink 2.0 per the `@Public` API deprecation policy.
   
   ## Verifying this change
   
   This change added tests and can be verified as follows:
   
   - `calculateRetryDelayShouldDoubleOnConsecutiveFailures` verifies that the 
retry delay doubles on each consecutive failure and that `currentRetryBackoff` 
advances correctly up to the configured maximum.
   - `calculateRetryDelayShouldResetAfterSuccess` verifies that the exponential 
backoff state resets to the initial value after a successful token renewal.
   - `calculateRetryDelayShouldCapToTtlBound` verifies that when a failure 
occurs close to token expiry, the retry delay is capped so the retry happens 
while the token is still valid.
   
   ## Does this pull request potentially affect one of the following parts:
   
     - Dependencies (does it add or upgrade a dependency): no
     - The public API, i.e., is any changed class annotated with 
`@Public(Evolving)`: yes (`SecurityOptions` is `@PublicEvolving` — two new 
config options are added and one is deprecated)
     - The serializers: no
     - The runtime per-record code paths (performance sensitive): no
     - Anything that affects deployment or recovery: JobManager (and its 
components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no
     - The S3 file system connector: no
   
   ## Documentation
   
     - Does this pull request introduce a new feature? no
     - If yes, how is the feature documented? not applicable
   
   ---
   
   ##### Was generative AI tooling used to co-author this PR?
   
   - [x] Yes Claude code
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to