Dennis-Mircea Ciupitu created FLINK-39826:
---------------------------------------------
Summary: Strengthen autoscaler configuration validation
Key: FLINK-39826
URL: https://issues.apache.org/jira/browse/FLINK-39826
Project: Flink
Issue Type: Improvement
Components: Kubernetes Operator
Affects Versions: kubernetes-operator-1.15.0
Reporter: Dennis-Mircea Ciupitu
Fix For: kubernetes-operator-1.16.0
h1. Summary
Several autoscaler configuration options are not validated, so invalid values
are accepted silently and surface only as confusing runtime behavior or, in one
case, as autoscaling that never runs. This issue tightens autoscaler
configuration validation to reject these misconfigurations when the resource is
submitted, instead of letting them degrade scaling silently.
h1. Background and Gaps
h2. Unbounded numeric options
The autoscaler validator currently bounds only a subset of numeric options
(utilization target, min and max, scale factors). Several other ratio-style
options are left unchecked:
- {{job.autoscaler.memory.gc-pressure.threshold}}
- {{job.autoscaler.memory.heap-usage.threshold}}
- {{job.autoscaler.scaling.effectiveness.threshold}}
- {{job.autoscaler.memory.tuning.overhead}}
These are all fractions that are only meaningful within the [0, 1] range, yet
out-of-range values are accepted today. For example, a scaling effectiveness
threshold above 1 silently blocks all scale ups, and a negative memory tuning
overhead can drive the tuned memory below the observed usage.
In addition, the observed scalability coefficient minimum is validated
unconditionally, even though it only takes effect when observed scalability is
enabled. Options that only matter behind a feature flag should only be
validated when that feature is on, otherwise a harmless value can be rejected.
h2. Metric window smaller than the reconcile interval
The autoscaler collects one metric sample per reconcile loop and requires at
least two samples within the metric window before it evaluates scaling. If the
metric window is configured smaller than the operator reconcile interval, the
window is trimmed down to a single sample on every loop, the two-sample
requirement is never met, and autoscaling is never applied. Nothing validates
this relationship today, so the autoscaler appears enabled while silently doing
nothing.
h1. Goal
Validate the above at resource submission time so misconfigurations are
reported as clear errors instead of silently degrading or disabling
autoscaling. Feature-gated options are validated only when their feature is
enabled, to avoid rejecting values that have no effect.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)