Dennis-Mircea Ciupitu created FLINK-39958:
---------------------------------------------
Summary: Autoscaler Flink REST client timeout is silently
overridden by the operator client timeout
Key: FLINK-39958
URL: https://issues.apache.org/jira/browse/FLINK-39958
Project: Flink
Issue Type: Bug
Components: Autoscaler, Kubernetes Operator
Affects Versions: kubernetes-operator-1.15.0
Reporter: Dennis-Mircea Ciupitu
h1. Problem
The autoscaler option {{AutoScalerOptions.FLINK_CLIENT_TIMEOUT}} (key
{{job.autoscaler.flink.rest-client.timeout}}, fallback key
{{kubernetes.operator.flink.rest-client.timeout}}, default {{10s}}) is
advertised and documented, but it has no effect when the autoscaler runs inside
the Kubernetes operator. Any value a user explicitly sets for it is silently
discarded.
h1. Root cause
When the operator constructs the autoscaler context, it first ingests the
resource's effective deploy configuration, which already includes any
user-provided {{job.autoscaler.flink.rest-client.timeout}} from
{{spec.flinkConfiguration}}, and then unconditionally overwrites
{{AutoScalerOptions.FLINK_CLIENT_TIMEOUT}} with the operator-level
{{OPERATOR_FLINK_CLIENT_TIMEOUT}}
({{kubernetes.operator.flink.client.timeout}}). Because this is an
unconditional override rather than a default, the user's explicit autoscaler
value is always clobbered.
h1. Impact
* The autoscaler REST-client timeout option is effectively a no-op in operator
mode. A user who follows the autoscaler documentation and sets
{{job.autoscaler.flink.rest-client.timeout}} sees no effect.
* The behavior is inconsistent across deployment modes: the same option works
in the standalone autoscaler (which has no operator config to override it) but
is dead inside the operator.
* Severity is low, because both options default to {{10s}}, so the override is
invisible unless a user explicitly tunes the autoscaler option.
h1. Expected behavior
The operator's client timeout should continue to act as the default for the
autoscaler, so that the autoscaler does not time out earlier or later than the
rest of the operator's Flink REST interactions. However, an explicitly
configured {{job.autoscaler.flink.rest-client.timeout}} must be honored instead
of being silently overwritten. In other words, the operator timeout should be
applied as a default/fallback, not as an unconditional override.
h1. Notes
* Backward compatible: with both options at the default {{10s}} nothing
changes. Only users who explicitly tune the autoscaler option are affected, and
for them the value now takes effect as documented.
* This is a behavioral fix in core reconcile / autoscaler wiring, so it
warrants a JIRA rather than a hotfix.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)