Re: Reworking the Rescale API

ConradJam Mon, 23 Jan 2023 21:34:06 -0800

Hello max

Thanks for driving it, I think there is no problem with your previous
suggestion of [1] FLINK-30773. Here I just put forward some supplements and
doubts.I have some suggestions and insights for this


 I have experienced the autoscaling of Flink K8S Operator for a part of the
time. The current method is to stop the job and modify the parallelism,
which will interrupt the business for a long time. I think the purpose of
modifying Rescaling Api is to better fit cloud native and reduce the impact
of job scaling downtime.

I have tried scaling with less time, and I call this step "hot update
parallelism" (if there is an available Slots, there is no need to re-deploy
the JobManager Or TaskManager on K8S)

Around this topic, I raised the *following questions*:
● Does scaling work on YARN, or just k8s?
   ○ I think we can support running on K8S for the first version, and Yarn
can be considered later
● Rescaling supports Standalone mode?
   ○ I think it can be supported. The essence is just to modify the
parallelism of job vertices. As for the tuning strategy, it should be
determined by the external system or K8S Operator
● Can we simplify the recovery steps?
   ○ As far as I know, the traditional way to adjust the parallelism is to
stop a job and do a Savepoint, and then run the job with the adjusted
parallelism. If we hide this step in the *JobManager*, it will be an
important means to reduce the delay.

  Of course, there are many details, such as
● At some point we may not be able to use this kind of hot update, and
still need to restart the job, when this happens, we should prevent users
from using rescaling requests
● After rescaling is submitted, when we fail, there should be a rollback
mechanism to roll back to the previous degree of parallelism.

more and more ～

  By the way, because the content may be more, I did not expand more ideas
and descriptions here. This proposal modifies the original Rescaling API.
I would also like to hear if  *@gyula* has some new ideas on this as it was
also involved in the development of FLIP-271
I am willing to write a FLIP for this purpose to improve and write some
ideas with dev Community and then submit it. What do you think about
starting a discussion for the community?


   1. https://issues.apache.org/jira/browse/FLINK-30773

Best～

Maximilian Michels <m...@apache.org> 于2023年1月24日周二 01:08写道：

> Hi,
>
> The current rescale API appears to be a work in progress. A couple years
> ago, we disabled access to the API [1].
>
> I'm looking into this problem as part of working on autoscaling [2] where
> we currently require a full restart of the job to apply the parallelism
> overrides. This adds additional delay and comes with the caveat that we
> don't know whether sufficient resources are available prior to executing
> the scaling decision. We obviously do not want to get stuck due to a lack
> of resources. So a rescale API would have to ensure enough resources are
> available prior to restarting the job.
>
> I've created an issue here:
> https://issues.apache.org/jira/browse/FLINK-30773
>
> Any comments or interest in working on this?
>
> -Max
>
> [1] https://issues.apache.org/jira/browse/FLINK-12312
> [2]
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-271%3A+Autoscaling
>


-- 
Best

ConradJam


-- 
Best

ConradJam

Re: Reworking the Rescale API

Reply via email to