[PR] [FLINK-39743] [flink-autoscaler] Fix incorrect expected processing rate computation [flink-kubernetes-operator]

via GitHub Tue, 02 Jun 2026 00:11:13 -0700


devu1997 opened a new pull request, #1127:
URL: https://github.com/apache/flink-kubernetes-operator/pull/1127


   ## What is the purpose of the change
   
   The Flink Autoscaler computes the expected processing rate 
(EXPECTED_PROCESSING_RATE) using cappedTargetCapacity, which is derived from 
the initial scaling factor proposed by the algorithm after bounding it within 
MAX_SCALE_DOWN_FACTOR and MAX_SCALE_UP_FACTOR.
   
   
https://github.com/apache/flink-kubernetes-operator/blame/main/flink-autoscaler/src/main/java/org/apache/flink/autoscaler/JobVertexScaler.java#L255
   
   However, the recommended parallelism can later be modified inside the scale 
method due to additional constraints such as max parallelism, vertex-level 
min/max limits, and input partition counts. As a result, the final applied 
caling factor may differ from the factor originally used to compute 
cappedTargetCapacity. This mismatch can cause effective scaling actions to be 
incorrectly classified as ineffective.
   
   This pull request is to compute the expected processing rate after the scale 
method finalizes and adjusts newParallelism based on constraints such as max 
parallelism, vertex-level min/max limits, input partition counts, etc.
   
   ## Brief change log
   
     - *Compute the expected processing rate after the scale method finalizes 
and adjusts newParallelism based on constraints such as max parallelism, 
vertex-level min/max limits, input partition counts, etc.*
   
   ## Verifying this change
   <!--
   Please make sure both new and modified tests in this PR follows the 
conventions defined in our code quality guide: 
https://flink.apache.org/contributing/code-style-and-quality-common.html#testing
   -->
   *(Please pick either of the following options)*
   
   This change is a trivial rework / code cleanup without any test coverage.
   
   *(or)*
   
   This change added tests and can be verified as follows:
   
   *(example:)*
     - *Added integration tests for end-to-end deployment with large payloads 
(100MB)*
     - *Extended integration test for recovery after master (JobManager) 
failure*
     - *Manually verified the change by running a 4 node cluster with 2 
JobManagers and 4 TaskManagers, a stateful streaming program, and killing one 
JobManager and two TaskManagers during the execution, verifying that recovery 
happens correctly.*
   
   ## Does this pull request potentially affect one of the following parts:
   
     - Dependencies (does it add or upgrade a dependency): no
     - The public API, i.e., is any changes to the `CustomResourceDescriptors`: 
no
     - Core observer or reconciler logic that is regularly executed: no
   
   ## Documentation
   
     - Does this pull request introduce a new feature? no
     - If yes, how is the feature documented? not applicable
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] [FLINK-39743] [flink-autoscaler] Fix incorrect expected processing rate computation [flink-kubernetes-operator]

Reply via email to