[PR] [FLINK-39890] Fix NaN in observed scaling coefficient when baseline rate is zero [flink-kubernetes-operator]

via GitHub Tue, 09 Jun 2026 02:31:19 -0700


lrsb opened a new pull request, #1134:
URL: https://github.com/apache/flink-kubernetes-operator/pull/1134


   ## What is the purpose of the change
   
   When `job.autoscaler.observed-scalability.enabled` is `true`, the autoscaler 
throws a `NumberFormatException` while computing the observed scaling 
coefficient for any vertex with a zero true-processing-rate (idle / very low 
traffic). The exception aborts the entire scaling pass, so no parallelism 
overrides / resource requirements are applied and the job stays stuck at its 
deployed parallelism.
   
   The root cause is in `AutoScalerUtils.optimizeLinearScalingCoefficient`: the 
denominator is `squaredSum * baselineProcessingRate`, but only `squaredSum` is 
guarded against zero. An idle vertex reports a finite true processing rate of 
`0.0` (passing the caller's `isNaN` guard), giving `baselineProcessingRate == 
0.0` and `sum == 0.0`. The result is `alpha = 0.0 / 0.0 = NaN`, which 
`Math.max`/`Math.min` do not sanitize, so it reaches `BigDecimal.valueOf(NaN)` 
and fails.
   
   This change guards the full denominator so a zero baseline (or zero 
`squaredSum`) falls back to linear scaling.
   
   ## Brief change log
   
     - Guard the complete denominator `squaredSum * baselineProcessingRate` in 
`AutoScalerUtils.optimizeLinearScalingCoefficient`, returning the 
linear-scaling fallback (`1.0`) when it is zero
   
   ## Verifying this change
   
   This change added tests and can be verified as follows:
   
     - Added 
`testCalculateScalingCoefficientWithZeroProcessingRateFallsBackToLinear` to 
`JobVertexScalerTest`, which builds a scaling history of vertices with a zero 
true-processing-rate and asserts `calculateObservedScalingCoefficient` returns 
the linear-scaling coefficient `1.0` instead of producing `NaN`
   
   ## Does this pull request potentially affect one of the following parts:
   
     - Dependencies (does it add or upgrade a dependency): no
     - The public API, i.e., is any changes to the `CustomResourceDescriptors`: 
no
     - Core observer or reconciler logic that is regularly executed: yes (the 
observed scaling coefficient is computed on every scaling pass)
   
   ## Documentation
   
     - Does this pull request introduce a new feature? no
     - If yes, how is the feature documented? not applicable
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] [FLINK-39890] Fix NaN in observed scaling coefficient when baseline rate is zero [flink-kubernetes-operator]

Reply via email to