[
https://issues.apache.org/jira/browse/FLINK-31976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17718219#comment-17718219
]
Tan Kim commented on FLINK-31976:
---------------------------------
As you say, this could be improved if we could trim the history from the GET
operation.
However, it doesn't seem to provide a fundamental workaround.
Below are some scaling-related metrics for situations where the lag is
increasing, but is marked as ineffective, preventing further scaling.
!image-2023-05-01-22-41-57-208.png|width=655,height=327!
Under normal circumstances (roughly between 09:00 and 10:00), the value of
TRUE_PROCESSING_RATE_AVG is located between SCALE_UP/DOWN_THRESHOLD. However,
at around 10:00, after the PARALLELISM value increases from 1 to 2, the
TRUE_PROCESSING_RATE_AVG value remains significantly below the
SCALE_UP_THRESHOLD, marking it as ineffective and not continuing to scale.
> Once marked as an inefficient scale-up, further scaling may not happen forever
> ------------------------------------------------------------------------------
>
> Key: FLINK-31976
> URL: https://issues.apache.org/jira/browse/FLINK-31976
> Project: Flink
> Issue Type: Improvement
> Components: Autoscaler
> Affects Versions: 1.17.0
> Reporter: Tan Kim
> Priority: Major
> Attachments: image-2023-05-01-22-41-57-208.png
>
>
> The determination of whether it is an inefficient scale-up is calculated as
> follows
> {code:java}
> double lastProcRate =
> lastSummary.getMetrics().get(TRUE_PROCESSING_RATE).getAverage();
> double lastExpectedProcRate =
> lastSummary.getMetrics().get(EXPECTED_PROCESSING_RATE).getCurrent();
> var currentProcRate = evaluatedMetrics.get(TRUE_PROCESSING_RATE).getAverage();
> double expectedIncrease = lastExpectedProcRate - lastProcRate;
> double actualIncrease = currentProcRate - lastProcRate;
> boolean withinEffectiveThreshold =
> (actualIncrease / expectedIncrease)
> >= conf.get(AutoScalerOptions.SCALING_EFFECTIVENESS_THRESHOLD);{code}
> Because the expectedIncrease value references the last scaling history, it
> will not change unless there is an additional scale-up, only the
> actualIncrease value will change.
> The actualIncrease value is currentProcRate( avg of TRUE_PROCESSING_RATE),
> The calculation of TRUE_PROCESSING_RATE is as follows
> trueProcessingRate = busyTimeMultiplier * numRecordsInPerSecond.getSum()
> For example, let's say you've been marked as an inefficient scale-up, but the
> LAG continues to build up.
> You need to scale up to eliminate the growing LAG, but because you're marked
> as an inefficient scale-up, it won't happen.
> To unmark a scaleup as inefficient, the following conditions must be met:
> actualIncrease/expectedIncrease > SCALING_EFFECTIVENESS_THRESHOLD (default
> 0.1)
> Here, expectedIncrease is a constant with lastSummary, so the value of
> actualIncrease must increase.
> However, the actualIncrease value is proportional to busyTimeMultiplier and
> numRecordsInPerSecond, and these two values will converge to a certain value
> if no scaling occurs.
> Therefore, the value of actualIncrease will also converge.
> If this value fails to cross a threshold, no further scaling up is possible,
> even if the lag continues to build up.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)