[jira] [Commented] (FLINK-35814) Don't scale up continuously when the throughout cannot be increased after scaling up

Rui Fan (Jira) Thu, 18 Jul 2024 19:40:04 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-35814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17867156#comment-17867156
 ]


Rui Fan commented on FLINK-35814:
---------------------------------

Thanks for the feedback!

I know that in some scenarios, infinitely increasing parallelism can alleviate 
backlog. For example: the job has data skew. There are two hot keys, and these 
two keys are in adjacent keyGroups. So they are assigned to the same subtask. 
When paralleslim is infinitely increased, these two keys may be assigned to two 
subtasks.

But I think the reasonable solution for this scenario is to solve the data 
skew, rather than infinitely increasing parallelism. Because after increasing 
parallelism, the same hot key still cannot be distributed in multiple subtasks 
and is still the bottleneck of the job.
{quote}I experienced false positives which led to pipelines building huge 
backlog. I think the feature needs to be made more robust before being enabled 
by default.
{quote}
Hey [~mxm] , would you mind elaborating on your pipeline?  We can think about 
how to enhance robustness together.

 

Based on the default parameters, autoscaler will avoid subsequent scale-up only 
when the actual throughput increase is less than 10% of the expected throughput 
increase after scaling up. Is it works if we check it multiple times and 2 as 
the default value?

It means we allow scale up if the actual throughput increase is less than 10% 
of the expected throughput increase only once. But if it still happen after 
scale up twice, we will block subsequent scale-up.

If so, we can deprecate the 
job.autoscaler.scaling.effectiveness.detection.enabled option, and introduce 
the new option: job.autoscaler.scaling.effectiveness.detection.number, the 
default value is 2.
 * When scaling.effectiveness.detection.number is set to 1, it means autoscaler 
will block subsequent scale-up once IneffectiveScaling happens.
 * When scaling.effectiveness.detection.number is set to n, it means autoscaler 
will block subsequent scale-up when n consecutive scale ups are all Ineffective 
Scaling
 * Of course, when users don't want to use `scaling.effectiveness.detection` 
feature, they can set scaling.effectiveness.detection.number is 
Interger.MAX_VALUE.

WDYT?

> Don't scale up continuously when the throughout cannot be increased after 
> scaling up
> ------------------------------------------------------------------------------------
>
>                 Key: FLINK-35814
>                 URL: https://issues.apache.org/jira/browse/FLINK-35814
>             Project: Flink
>          Issue Type: Improvement
>          Components: Autoscaler
>            Reporter: Rui Fan
>            Assignee: Rui Fan
>            Priority: Major
>
> h2. Motivation
> Currently, the parallelism will be increased continuously when some cases 
> happen. Such as: data skew, bottleneck occurs in other system.
> In these cases, the throughout(processing rate) cannot be increased even if 
> we increase the parallelism.
> h2. Solution
> We don't need to scale up the task continuously when the throughout cannot be 
> increased after scaling up. 
> And it's better to trigger some events to reminder users fix the issue 
> manually.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-35814) Don't scale up continuously when the throughout cannot be increased after scaling up

Reply via email to