[
https://issues.apache.org/jira/browse/FLINK-39958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Gyula Fora closed FLINK-39958.
------------------------------
Fix Version/s: kubernetes-operator-1.16.0
Assignee: Dennis-Mircea Ciupitu
Resolution: Fixed
merged to main d971aa3ef0dec19cd98361162a252e1ede575ab1
> Autoscaler Flink REST client timeout is silently overridden by the operator
> client timeout
> ------------------------------------------------------------------------------------------
>
> Key: FLINK-39958
> URL: https://issues.apache.org/jira/browse/FLINK-39958
> Project: Flink
> Issue Type: Bug
> Components: Autoscaler, Kubernetes Operator
> Affects Versions: kubernetes-operator-1.15.0
> Reporter: Dennis-Mircea Ciupitu
> Assignee: Dennis-Mircea Ciupitu
> Priority: Major
> Labels: pull-request-available
> Fix For: kubernetes-operator-1.16.0
>
>
> h1. Problem
> The autoscaler option {{AutoScalerOptions.FLINK_CLIENT_TIMEOUT}} (key
> {{job.autoscaler.flink.rest-client.timeout}}, fallback key
> {{kubernetes.operator.flink.rest-client.timeout}}, default {{10s}}) is
> advertised and documented, but it has no effect when the autoscaler runs
> inside the Kubernetes operator. Any value a user explicitly sets for it is
> silently discarded.
> h1. Root cause
> When the operator constructs the autoscaler context, it first ingests the
> resource's effective deploy configuration, which already includes any
> user-provided {{job.autoscaler.flink.rest-client.timeout}} from
> {{spec.flinkConfiguration}}, and then unconditionally overwrites
> {{AutoScalerOptions.FLINK_CLIENT_TIMEOUT}} with the operator-level
> {{OPERATOR_FLINK_CLIENT_TIMEOUT}}
> ({{kubernetes.operator.flink.client.timeout}}). Because this is an
> unconditional override rather than a default, the user's explicit autoscaler
> value is always clobbered.
> h1. Impact
> * The autoscaler REST-client timeout option is effectively a no-op in
> operator mode. A user who follows the autoscaler documentation and sets
> {{job.autoscaler.flink.rest-client.timeout}} sees no effect.
> * The behavior is inconsistent across deployment modes: the same option works
> in the standalone autoscaler (which has no operator config to override it)
> but is dead inside the operator.
> * Severity is low, because both options default to {{10s}}, so the override
> is invisible unless a user explicitly tunes the autoscaler option.
> h1. Expected behavior
> The operator's client timeout should continue to act as the default for the
> autoscaler, so that the autoscaler does not time out earlier or later than
> the rest of the operator's Flink REST interactions. However, an explicitly
> configured {{job.autoscaler.flink.rest-client.timeout}} must be honored
> instead of being silently overwritten. In other words, the operator timeout
> should be applied as a default/fallback, not as an unconditional override.
> h1. Notes
> * Backward compatible: with both options at the default {{10s}} nothing
> changes. Only users who explicitly tune the autoscaler option are affected,
> and for them the value now takes effect as documented.
> * This is a behavioral fix in core reconcile / autoscaler wiring, so it
> warrants a JIRA rather than a hotfix.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)