Dennis-Mircea Ciupitu created FLINK-39826:
---------------------------------------------

             Summary: Strengthen autoscaler configuration validation
                 Key: FLINK-39826
                 URL: https://issues.apache.org/jira/browse/FLINK-39826
             Project: Flink
          Issue Type: Improvement
          Components: Kubernetes Operator
    Affects Versions: kubernetes-operator-1.15.0
            Reporter: Dennis-Mircea Ciupitu
             Fix For: kubernetes-operator-1.16.0


h1. Summary

Several autoscaler configuration options are not validated, so invalid values 
are accepted silently and surface only as confusing runtime behavior or, in one 
case, as autoscaling that never runs. This issue tightens autoscaler 
configuration validation to reject these misconfigurations when the resource is 
submitted, instead of letting them degrade scaling silently.

h1. Background and Gaps

h2. Unbounded numeric options

The autoscaler validator currently bounds only a subset of numeric options 
(utilization target, min and max, scale factors). Several other ratio-style 
options are left unchecked:

- {{job.autoscaler.memory.gc-pressure.threshold}}
- {{job.autoscaler.memory.heap-usage.threshold}}
- {{job.autoscaler.scaling.effectiveness.threshold}}
- {{job.autoscaler.memory.tuning.overhead}}

These are all fractions that are only meaningful within the [0, 1] range, yet 
out-of-range values are accepted today. For example, a scaling effectiveness 
threshold above 1 silently blocks all scale ups, and a negative memory tuning 
overhead can drive the tuned memory below the observed usage.

In addition, the observed scalability coefficient minimum is validated 
unconditionally, even though it only takes effect when observed scalability is 
enabled. Options that only matter behind a feature flag should only be 
validated when that feature is on, otherwise a harmless value can be rejected.

h2. Metric window smaller than the reconcile interval

The autoscaler collects one metric sample per reconcile loop and requires at 
least two samples within the metric window before it evaluates scaling. If the 
metric window is configured smaller than the operator reconcile interval, the 
window is trimmed down to a single sample on every loop, the two-sample 
requirement is never met, and autoscaling is never applied. Nothing validates 
this relationship today, so the autoscaler appears enabled while silently doing 
nothing.

h1. Goal

Validate the above at resource submission time so misconfigurations are 
reported as clear errors instead of silently degrading or disabling 
autoscaling. Feature-gated options are validated only when their feature is 
enabled, to avoid rejecting values that have no effect.




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to