[
https://issues.apache.org/jira/browse/FLINK-35814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17867156#comment-17867156
]
Rui Fan commented on FLINK-35814:
---------------------------------
Thanks for the feedback!
I know that in some scenarios, infinitely increasing parallelism can alleviate
backlog. For example: the job has data skew. There are two hot keys, and these
two keys are in adjacent keyGroups. So they are assigned to the same subtask.
When paralleslim is infinitely increased, these two keys may be assigned to two
subtasks.
But I think the reasonable solution for this scenario is to solve the data
skew, rather than infinitely increasing parallelism. Because after increasing
parallelism, the same hot key still cannot be distributed in multiple subtasks
and is still the bottleneck of the job.
{quote}I experienced false positives which led to pipelines building huge
backlog. I think the feature needs to be made more robust before being enabled
by default.
{quote}
Hey [~mxm] , would you mind elaborating on your pipeline? We can think about
how to enhance robustness together.
Based on the default parameters, autoscaler will avoid subsequent scale-up only
when the actual throughput increase is less than 10% of the expected throughput
increase after scaling up. Is it works if we check it multiple times and 2 as
the default value?
It means we allow scale up if the actual throughput increase is less than 10%
of the expected throughput increase only once. But if it still happen after
scale up twice, we will block subsequent scale-up.
If so, we can deprecate the
job.autoscaler.scaling.effectiveness.detection.enabled option, and introduce
the new option: job.autoscaler.scaling.effectiveness.detection.number, the
default value is 2.
* When scaling.effectiveness.detection.number is set to 1, it means autoscaler
will block subsequent scale-up once IneffectiveScaling happens.
* When scaling.effectiveness.detection.number is set to n, it means autoscaler
will block subsequent scale-up when n consecutive scale ups are all Ineffective
Scaling
* Of course, when users don't want to use `scaling.effectiveness.detection`
feature, they can set scaling.effectiveness.detection.number is
Interger.MAX_VALUE.
WDYT?
> Don't scale up continuously when the throughout cannot be increased after
> scaling up
> ------------------------------------------------------------------------------------
>
> Key: FLINK-35814
> URL: https://issues.apache.org/jira/browse/FLINK-35814
> Project: Flink
> Issue Type: Improvement
> Components: Autoscaler
> Reporter: Rui Fan
> Assignee: Rui Fan
> Priority: Major
>
> h2. Motivation
> Currently, the parallelism will be increased continuously when some cases
> happen. Such as: data skew, bottleneck occurs in other system.
> In these cases, the throughout(processing rate) cannot be increased even if
> we increase the parallelism.
> h2. Solution
> We don't need to scale up the task continuously when the throughout cannot be
> increased after scaling up.
> And it's better to trigger some events to reminder users fix the issue
> manually.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)