devu1997 opened a new pull request, #1127: URL: https://github.com/apache/flink-kubernetes-operator/pull/1127
## What is the purpose of the change The Flink Autoscaler computes the expected processing rate (EXPECTED_PROCESSING_RATE) using cappedTargetCapacity, which is derived from the initial scaling factor proposed by the algorithm after bounding it within MAX_SCALE_DOWN_FACTOR and MAX_SCALE_UP_FACTOR. https://github.com/apache/flink-kubernetes-operator/blame/main/flink-autoscaler/src/main/java/org/apache/flink/autoscaler/JobVertexScaler.java#L255 However, the recommended parallelism can later be modified inside the scale method due to additional constraints such as max parallelism, vertex-level min/max limits, and input partition counts. As a result, the final applied caling factor may differ from the factor originally used to compute cappedTargetCapacity. This mismatch can cause effective scaling actions to be incorrectly classified as ineffective. This pull request is to compute the expected processing rate after the scale method finalizes and adjusts newParallelism based on constraints such as max parallelism, vertex-level min/max limits, input partition counts, etc. ## Brief change log - *Compute the expected processing rate after the scale method finalizes and adjusts newParallelism based on constraints such as max parallelism, vertex-level min/max limits, input partition counts, etc.* ## Verifying this change <!-- Please make sure both new and modified tests in this PR follows the conventions defined in our code quality guide: https://flink.apache.org/contributing/code-style-and-quality-common.html#testing --> *(Please pick either of the following options)* This change is a trivial rework / code cleanup without any test coverage. *(or)* This change added tests and can be verified as follows: *(example:)* - *Added integration tests for end-to-end deployment with large payloads (100MB)* - *Extended integration test for recovery after master (JobManager) failure* - *Manually verified the change by running a 4 node cluster with 2 JobManagers and 4 TaskManagers, a stateful streaming program, and killing one JobManager and two TaskManagers during the execution, verifying that recovery happens correctly.* ## Does this pull request potentially affect one of the following parts: - Dependencies (does it add or upgrade a dependency): no - The public API, i.e., is any changes to the `CustomResourceDescriptors`: no - Core observer or reconciler logic that is regularly executed: no ## Documentation - Does this pull request introduce a new feature? no - If yes, how is the feature documented? not applicable -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
