[GitHub] [flink-kubernetes-operator] mxm commented on pull request #597: [FLINK-32100] Use the total number of Kafka partitions as the max source parallelism

via GitHub Tue, 16 May 2023 09:48:15 -0700


mxm commented on PR #597:
URL: 
https://github.com/apache/flink-kubernetes-operator/pull/597#issuecomment-1550024389


   We don't know how many divisors the number of partitions of a topic has. 
Similarly, we don't know how how many divisors the sum of the number of 
partitions of all topics has. What you say applies to any operator, not only to 
source operators. Source operators are limited by the max partitions which are 
usually smaller than the configured max vertex which will always be the max 
also for all vertices including sources.
   
   From my experience the worst thing that can happen is that the source is 
underprovisioned which this PR tries to avoid. Underprovisioned sources suffer 
from watermark alignment issues and will build huge lag. We need to allow 
sources to reach the max scale up which is the sum of all partitions.
   
   Let's say you have 73 (prime number) partitions because you read from topic1 
with 60 partitions and topic2 with 13 partitions. The autoscaler can now decide 
to scale between 1 and 73. Any of these parallelisms will work. In case of a 
non-prime sum of partitions, the scale up will at worst case be 
`Math.min(sum_of_partitions / 2, max_parallelism)` in case of key-group 
alignment. Otherwise, we will choose the target parallelism like in the prime 
example.
   
   I think we can think about adding a config option to disable keygroup 
alignment.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [flink-kubernetes-operator] mxm commented on pull request #597: [FLINK-32100] Use the total number of Kafka partitions as the max source parallelism

Reply via email to