Re: Reworking the Rescale API

Gyula Fóra Thu, 26 Jan 2023 06:18:43 -0800

If the adaptive scheduler would support all execution modes like Native
Applications, Sessions etc including active resource management then I
think we could use that all the time. I would love to use one scheduler
instead of having 2 options.


Currently however there is a huge gap in functionality between
active/passive resource management and from my experience, the active
(native) integration is much more convenient for Kubernetes environments.

Gyula

On Thu, Jan 26, 2023 at 3:13 PM Konstantin Knauf <kna...@apache.org> wrote:

> Hi Gyula,
>
> if the adaptive scheduler supported active resource managers, would there
> be any other blocker to migrate to it? I don't know much about the
> implementation-side here, but conceptually once we have session mode
> support and each Jobs in a session clusters declaris their desired
> parallelism (!=infinity) there shouldn't be a big gap to support active
> resource managers. Am I missing something, Chesnay?
>
> Regarding the complexity, I was referring to the procedure that Max
> outlines in his ticket around check if slots are available or not and then
> triggering scaling operations. The adaptive scheduler already does this and
> is more responsive in that regard than an external process would be in my
> understanding.
>
> Cheers,
>
> Konstantin
>
>
>
> Am Do., 26. Jan. 2023 um 15:05 Uhr schrieb Gyula Fóra <
> gyula.f...@gmail.com>:
>
>> Hi Konstantin!
>>
>> I think the Adaptive Scheduler still will not support Kubernetes Native
>> integration and can only be used in standalone mode. This means that the
>> operator needs to manage all resources externally, and compute exactly how
>> much new slots are needed during rescaling etc.
>>
>> I think whatever scaling API we build, it should work for both standalone
>> and native integration as much as possible. It's not a duplicated effort to
>> add it to the standard scheduler as long as the adaptive scheduler does not
>> support active resource management.
>>
>> Also it seems this will not reduce complexity on the operator side, which
>> can already do scaling actions by executing an upgrade.
>>
>> And a side note: the operator supports both native and standalone
>> integration (both standard and adaptive scheduler this way) but the bigger
>> problem is actually computing the required number of slots and required new
>> resources which is much harder than simply using active resource management.
>>
>> Cheers,
>> Gyula
>>
>> On Thu, Jan 26, 2023 at 2:57 PM Konstantin Knauf <kna...@apache.org>
>> wrote:
>>
>>> Hi Max,
>>>
>>> it seems to me we are now running in some of the potential duplication
>>> of efforts across the standard and adaptive scheduler that Chesnay had
>>> mentioned on the original ticket. The issue of having to do a full restart
>>> of the Job for rescaling as well as waiting for resources to be available
>>> before doing a rescaling operation were some of the main motivations behind
>>> introducing the adaptive scheduler. In the adaptive scheduler we can
>>> further do things like only to trigger a rescaling operations exactly when
>>> a checkpoint was completed to minimize reprocessing. For Jobs with small
>>> state size, the downtime during rescaling can already be << 1 second today.
>>>
>>> Chesnay and David Moravek are currently in the process of drafting two
>>> FLIPs that will extend the support of the adaptive scheduler to session
>>> mode and will allow clients to change the desired/min/max parallelism of
>>> the vertices of a Job during its runtime via the REST API. We currently
>>> plan to publish a draft of these FLIPs next week for discussion. Would you
>>> consider moving to the adaptive scheduler for the kubernetes operator
>>> provided these FLIPs make it? I think, it has the potential to simplify the
>>> logic required for rescaling on the operator side quite a bit.
>>>
>>> Best,
>>>
>>> Konstantin
>>>
>>>
>>> Am Do., 26. Jan. 2023 um 12:16 Uhr schrieb Maximilian Michels <
>>> m...@apache.org>:
>>>
>>>> Hey ConradJam,
>>>>
>>>> Thank you for your thoughtful response. It would be great to start
>>>> writing
>>>> a FLIP for the Rescale API. If you want to take a stab, please go ahead,
>>>> I'd be happy to review. I'm sure Gyula or others will also chime in.
>>>>
>>>> I want to answer your question so we are aligned:
>>>>
>>>> ● Does scaling work on YARN, or just k8s?
>>>> >
>>>>
>>>> I think it should work for both YARN and K8s. We would have to make
>>>> changes
>>>> to the drivers (AbstractResourceManagerDriver) which is implemented for
>>>> both K8s and YARN. The outlined approach for rescaling does not require
>>>> integrating with those systems, just maybe updating how the driver is
>>>> used,
>>>> so we should be able to make it work across both YARN and K8s.
>>>>
>>>> ● Rescaling supports Standalone mode?
>>>> >
>>>>
>>>> Yes, I think it should and easily can. We do use a different type of
>>>> resource manager (StandaloneResourceManager, not ActiveResourceManager)
>>>> but
>>>> I think the logic will sit on a higher level where the ResourceManager
>>>> implementation is not relevant.
>>>>
>>>> ● Can we simplify the recovery steps?
>>>> >
>>>>
>>>> For the first version, I would prefer the simple approach of (1)
>>>> acquiring
>>>> the required slots for rescaling, then (2) trigger a stop with savepoint
>>>> (3) resubmit the job with updated parallelisms. What you have in mind
>>>> is a
>>>> bit more involved but certainly a great optimization, especially when
>>>> only
>>>> a fraction of the job state needs to be repartitioned.
>>>>
>>>> Of course, there are many details, such as
>>>> > ● At some point we may not be able to use this kind of hot update, and
>>>> > still need to restart the job, when this happens, we should prevent
>>>> users
>>>> > from using rescaling requests
>>>> >
>>>>
>>>> I'm curious to learn more about "hot updates". How would we support
>>>> this in
>>>> Flink? Would we have to support dynamically repartitioning tasks? I
>>>> don't
>>>> think Flink supports this yet. For now, restarting the job may be the
>>>> best
>>>> we can do.
>>>>
>>>> ● After rescaling is submitted, when we fail, there should be a rollback
>>>> > mechanism to roll back to the previous degree of parallelism.
>>>> >
>>>>
>>>> This should not be necessary if all the requirements for rescaling, e.g.
>>>> enough task slots, are satisfied by the Rescale API. I'm not even sure
>>>> rolling back is an option because we can't guarantee that a rollback
>>>> would
>>>> always work.
>>>>
>>>> Thanks,
>>>> Max
>>>>
>>>> On Tue, Jan 24, 2023 at 6:34 AM ConradJam <jam.gz...@gmail.com> wrote:
>>>>
>>>> > Hello max
>>>> >
>>>> > Thanks for driving it, I think there is no problem with your previous
>>>> > suggestion of [1] FLINK-30773. Here I just put forward some
>>>> supplements and
>>>> > doubts.I have some suggestions and insights for this
>>>> >
>>>> >  I have experienced the autoscaling of Flink K8S Operator for a part
>>>> of the
>>>> > time. The current method is to stop the job and modify the
>>>> parallelism,
>>>> > which will interrupt the business for a long time. I think the
>>>> purpose of
>>>> > modifying Rescaling Api is to better fit cloud native and reduce the
>>>> impact
>>>> > of job scaling downtime.
>>>> >
>>>> > I have tried scaling with less time, and I call this step "hot update
>>>> > parallelism" (if there is an available Slots, there is no need to
>>>> re-deploy
>>>> > the JobManager Or TaskManager on K8S)
>>>> >
>>>> > Around this topic, I raised the *following questions*:
>>>> > ● Does scaling work on YARN, or just k8s?
>>>> >    ○ I think we can support running on K8S for the first version, and
>>>> Yarn
>>>> > can be considered later
>>>> > ● Rescaling supports Standalone mode?
>>>> >    ○ I think it can be supported. The essence is just to modify the
>>>> > parallelism of job vertices. As for the tuning strategy, it should be
>>>> > determined by the external system or K8S Operator
>>>> > ● Can we simplify the recovery steps?
>>>> >    ○ As far as I know, the traditional way to adjust the parallelism
>>>> is to
>>>> > stop a job and do a Savepoint, and then run the job with the adjusted
>>>> > parallelism. If we hide this step in the *JobManager*, it will be an
>>>> > important means to reduce the delay.
>>>> >
>>>> >   Of course, there are many details, such as
>>>> > ● At some point we may not be able to use this kind of hot update, and
>>>> > still need to restart the job, when this happens, we should prevent
>>>> users
>>>> > from using rescaling requests
>>>> > ● After rescaling is submitted, when we fail, there should be a
>>>> rollback
>>>> > mechanism to roll back to the previous degree of parallelism.
>>>> >
>>>> > more and more ～
>>>> >
>>>> >   By the way, because the content may be more, I did not expand more
>>>> ideas
>>>> > and descriptions here. This proposal modifies the original Rescaling
>>>> API.
>>>> > I would also like to hear if  *@gyula* has some new ideas on this as
>>>> it was
>>>> > also involved in the development of FLIP-271
>>>> > I am willing to write a FLIP for this purpose to improve and write
>>>> some
>>>> > ideas with dev Community and then submit it. What do you think about
>>>> > starting a discussion for the community?
>>>> >
>>>> >
>>>> >    1. https://issues.apache.org/jira/browse/FLINK-30773
>>>> >
>>>> > Best～
>>>> >
>>>> > Maximilian Michels <m...@apache.org> 于2023年1月24日周二 01:08写道：
>>>> >
>>>> > > Hi,
>>>> > >
>>>> > > The current rescale API appears to be a work in progress. A couple
>>>> years
>>>> > > ago, we disabled access to the API [1].
>>>> > >
>>>> > > I'm looking into this problem as part of working on autoscaling [2]
>>>> where
>>>> > > we currently require a full restart of the job to apply the
>>>> parallelism
>>>> > > overrides. This adds additional delay and comes with the caveat
>>>> that we
>>>> > > don't know whether sufficient resources are available prior to
>>>> executing
>>>> > > the scaling decision. We obviously do not want to get stuck due to
>>>> a lack
>>>> > > of resources. So a rescale API would have to ensure enough
>>>> resources are
>>>> > > available prior to restarting the job.
>>>> > >
>>>> > > I've created an issue here:
>>>> > > https://issues.apache.org/jira/browse/FLINK-30773
>>>> > >
>>>> > > Any comments or interest in working on this?
>>>> > >
>>>> > > -Max
>>>> > >
>>>> > > [1] https://issues.apache.org/jira/browse/FLINK-12312
>>>> > > [2]
>>>> > >
>>>> >
>>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-271%3A+Autoscaling
>>>> > >
>>>> >
>>>> >
>>>> > --
>>>> > Best
>>>> >
>>>> > ConradJam
>>>> >
>>>> >
>>>> > --
>>>> > Best
>>>> >
>>>> > ConradJam
>>>> >
>>>>
>>>
>>>
>>> --
>>> https://twitter.com/snntrable
>>> https://github.com/knaufk
>>>
>>
>
> --
> https://twitter.com/snntrable
> https://github.com/knaufk
>

Re: Reworking the Rescale API

Reply via email to