[
https://issues.apache.org/jira/browse/FLINK-31976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17718235#comment-17718235
]
Tan Kim commented on FLINK-31976:
---------------------------------
Thank you for clarifying the issue.
I guess I was misleading in what I said.
I think it would be good to discuss 2. further.
In the chart above, we need to think about why the TRUE_PROCESSING_RATE_AVG
value still remained constant after PARALLELISM went from 1 to 2.
The next chart shows the metrics for the source operator over the same time
period.
!image-2023-05-01-23-55-08-254.png|width=703,height=396!
The PARALLELISM value went from 1 to 2 around 10:50, when the lag started to
build up a bit.
However, there was still no increase in throughput.
Since the throughput of the source operator didn't increase, it makes sense
that the throughput downstream (chart above) didn't change either.
I suspect that this issue might be related to the Kafka source connector.
During our testing, we sometimes noticed that even though the PARALLELISM of
the source operator increased, only some subtasks were busy and the rest were
idle.
This could be an issue with the topic partition not being evenly distributed
across consumers (subtasks) when TM restarts as scaling occurs.
If my guess is correct, this could have been judged as ineffective scaling
because increasing the parallelism of the source would not result in any change
in throughput, and the same goes for the downstream.
> Once marked as an inefficient scale-up, further scaling may not happen forever
> ------------------------------------------------------------------------------
>
> Key: FLINK-31976
> URL: https://issues.apache.org/jira/browse/FLINK-31976
> Project: Flink
> Issue Type: Improvement
> Components: Autoscaler
> Affects Versions: 1.17.0
> Reporter: Tan Kim
> Priority: Major
> Attachments: image-2023-05-01-22-41-57-208.png,
> image-2023-05-01-23-54-06-383.png, image-2023-05-01-23-55-08-254.png
>
>
> The determination of whether it is an inefficient scale-up is calculated as
> follows
> {code:java}
> double lastProcRate =
> lastSummary.getMetrics().get(TRUE_PROCESSING_RATE).getAverage();
> double lastExpectedProcRate =
> lastSummary.getMetrics().get(EXPECTED_PROCESSING_RATE).getCurrent();
> var currentProcRate = evaluatedMetrics.get(TRUE_PROCESSING_RATE).getAverage();
> double expectedIncrease = lastExpectedProcRate - lastProcRate;
> double actualIncrease = currentProcRate - lastProcRate;
> boolean withinEffectiveThreshold =
> (actualIncrease / expectedIncrease)
> >= conf.get(AutoScalerOptions.SCALING_EFFECTIVENESS_THRESHOLD);{code}
> Because the expectedIncrease value references the last scaling history, it
> will not change unless there is an additional scale-up, only the
> actualIncrease value will change.
> The actualIncrease value is currentProcRate( avg of TRUE_PROCESSING_RATE),
> The calculation of TRUE_PROCESSING_RATE is as follows
> trueProcessingRate = busyTimeMultiplier * numRecordsInPerSecond.getSum()
> For example, let's say you've been marked as an inefficient scale-up, but the
> LAG continues to build up.
> You need to scale up to eliminate the growing LAG, but because you're marked
> as an inefficient scale-up, it won't happen.
> To unmark a scaleup as inefficient, the following conditions must be met:
> actualIncrease/expectedIncrease > SCALING_EFFECTIVENESS_THRESHOLD (default
> 0.1)
> Here, expectedIncrease is a constant with lastSummary, so the value of
> actualIncrease must increase.
> However, the actualIncrease value is proportional to busyTimeMultiplier and
> numRecordsInPerSecond, and these two values will converge to a certain value
> if no scaling occurs.
> Therefore, the value of actualIncrease will also converge.
> If this value fails to cross a threshold, no further scaling up is possible,
> even if the lag continues to build up.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)