Re: [PR] [FLINK-31215] [autoscaler] Backpropagate processing rate limits from non-scalable bottlenecks to upstream operators [flink-kubernetes-operator]

via GitHub Tue, 16 Jul 2024 03:19:43 -0700


aplyusnin commented on PR #847:
URL: 
https://github.com/apache/flink-kubernetes-operator/pull/847#issuecomment-2230539713


   Thank you for your reply.
   
   Now, the backpropagation logic for a single vertex is the following:
   
   1. Adjust target data rate by factor from downstream (processingRateCapacity 
and currentBackPropFactor)
   2. Update backpropagation factor if required parallelism (target data rate 
divided by true processing rate) exceeds max parallelism of the vertex
   3. Evaluate the data rate comming from the direct upstream
   4. Backpropagate factor to direct upstream
   
   For example, take a look at **operator 3**.
   
   
![pr](https://github.com/user-attachments/assets/347b5d4c-02a1-4b2e-8d74-742e30aa768c)
   
   Initially, it's target data rate was 250, and it is lowered by upstream by 
0.8 and become 200.
   In order to process the whole data rate, the new parallelism should be 200 / 
50 * 10 = 40 (target data rate / processing rate * parallelism).
   
   This value is 2 times bigger than max parallelism (40 / 20 = 2), so the 
backpressure factor to propagate is 0.8 (from upstream) * 20 / 40 (the vertex 
is a bottleneck) = 0.4.
   
   
![pr2](https://github.com/user-attachments/assets/0a85b80c-97f3-45f2-9483-bdd8d38de1aa)
   
   Now it's time to propagate the factor to the direct upstream (operator 1 and 
operator 2). Note that operator 1 is already adjusted by some other vertices.
   
   At first, the data rate from the direct upstream is evaluated (target data 
rate * output rate * backpressure factor): 100 * 2 * 0.5 = 100 from **operator 
1** and 50 * 1 * 1 = 50 from **operator 2**, summing up to 150.
   
   Since the adjusted target data rate of **operator 3** is 100 and the 
upstream provides 150, **all** direct upstream operators should be lowered. To 
do it, their backpressure factor should be multiplied by 100 / 150 = 2/3 
(target data rate / data rate from the upstream).
   
   
![pr3](https://github.com/user-attachments/assets/c3675766-3852-4b8a-84cc-5095065a6e29)
   Hope this example helps.
   
   
   This process repeats for all vertices in reverse topological order. Then, 
the target data rate is updated using scale factors propagated to sources.
   
   There are also some extra checks to prevent aggressive scaling down.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [FLINK-31215] [autoscaler] Backpropagate processing rate limits from non-scalable bottlenecks to upstream operators [flink-kubernetes-operator]

Reply via email to