Dennis-Mircea Ciupitu created FLINK-39958:
---------------------------------------------

             Summary: Autoscaler Flink REST client timeout is silently 
overridden by the operator client timeout
                 Key: FLINK-39958
                 URL: https://issues.apache.org/jira/browse/FLINK-39958
             Project: Flink
          Issue Type: Bug
          Components: Autoscaler, Kubernetes Operator
    Affects Versions: kubernetes-operator-1.15.0
            Reporter: Dennis-Mircea Ciupitu


h1. Problem

The autoscaler option {{AutoScalerOptions.FLINK_CLIENT_TIMEOUT}} (key 
{{job.autoscaler.flink.rest-client.timeout}}, fallback key 
{{kubernetes.operator.flink.rest-client.timeout}}, default {{10s}}) is 
advertised and documented, but it has no effect when the autoscaler runs inside 
the Kubernetes operator. Any value a user explicitly sets for it is silently 
discarded.

h1. Root cause

When the operator constructs the autoscaler context, it first ingests the 
resource's effective deploy configuration, which already includes any 
user-provided {{job.autoscaler.flink.rest-client.timeout}} from 
{{spec.flinkConfiguration}}, and then unconditionally overwrites 
{{AutoScalerOptions.FLINK_CLIENT_TIMEOUT}} with the operator-level 
{{OPERATOR_FLINK_CLIENT_TIMEOUT}} 
({{kubernetes.operator.flink.client.timeout}}). Because this is an 
unconditional override rather than a default, the user's explicit autoscaler 
value is always clobbered.

h1. Impact

* The autoscaler REST-client timeout option is effectively a no-op in operator 
mode. A user who follows the autoscaler documentation and sets 
{{job.autoscaler.flink.rest-client.timeout}} sees no effect.
* The behavior is inconsistent across deployment modes: the same option works 
in the standalone autoscaler (which has no operator config to override it) but 
is dead inside the operator.
* Severity is low, because both options default to {{10s}}, so the override is 
invisible unless a user explicitly tunes the autoscaler option.

h1. Expected behavior

The operator's client timeout should continue to act as the default for the 
autoscaler, so that the autoscaler does not time out earlier or later than the 
rest of the operator's Flink REST interactions. However, an explicitly 
configured {{job.autoscaler.flink.rest-client.timeout}} must be honored instead 
of being silently overwritten. In other words, the operator timeout should be 
applied as a default/fallback, not as an unconditional override.

h1. Notes

* Backward compatible: with both options at the default {{10s}} nothing 
changes. Only users who explicitly tune the autoscaler option are affected, and 
for them the value now takes effect as documented.
* This is a behavioral fix in core reconcile / autoscaler wiring, so it 
warrants a JIRA rather than a hotfix.




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to