lrsb opened a new pull request, #1134:
URL: https://github.com/apache/flink-kubernetes-operator/pull/1134
## What is the purpose of the change
When `job.autoscaler.observed-scalability.enabled` is `true`, the autoscaler
throws a `NumberFormatException` while computing the observed scaling
coefficient for any vertex with a zero true-processing-rate (idle / very low
traffic). The exception aborts the entire scaling pass, so no parallelism
overrides / resource requirements are applied and the job stays stuck at its
deployed parallelism.
The root cause is in `AutoScalerUtils.optimizeLinearScalingCoefficient`: the
denominator is `squaredSum * baselineProcessingRate`, but only `squaredSum` is
guarded against zero. An idle vertex reports a finite true processing rate of
`0.0` (passing the caller's `isNaN` guard), giving `baselineProcessingRate ==
0.0` and `sum == 0.0`. The result is `alpha = 0.0 / 0.0 = NaN`, which
`Math.max`/`Math.min` do not sanitize, so it reaches `BigDecimal.valueOf(NaN)`
and fails.
This change guards the full denominator so a zero baseline (or zero
`squaredSum`) falls back to linear scaling.
## Brief change log
- Guard the complete denominator `squaredSum * baselineProcessingRate` in
`AutoScalerUtils.optimizeLinearScalingCoefficient`, returning the
linear-scaling fallback (`1.0`) when it is zero
## Verifying this change
This change added tests and can be verified as follows:
- Added
`testCalculateScalingCoefficientWithZeroProcessingRateFallsBackToLinear` to
`JobVertexScalerTest`, which builds a scaling history of vertices with a zero
true-processing-rate and asserts `calculateObservedScalingCoefficient` returns
the linear-scaling coefficient `1.0` instead of producing `NaN`
## Does this pull request potentially affect one of the following parts:
- Dependencies (does it add or upgrade a dependency): no
- The public API, i.e., is any changes to the `CustomResourceDescriptors`:
no
- Core observer or reconciler logic that is regularly executed: yes (the
observed scaling coefficient is computed on every scaling pass)
## Documentation
- Does this pull request introduce a new feature? no
- If yes, how is the feature documented? not applicable
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]