Re: [DISCUSS] FLIP-322 Cooldown period for adaptive scheduler

Robert Metzger Thu, 15 Jun 2023 04:36:08 -0700

Thanks for the FLIP.

Some comments:
1. Can you specify the full proposed configuration name? "
scaling-cooldown-period" is probably not the full config name?
2. Why is the concept of scaling events and a scaling queue needed? If I
remember correctly, the adaptive scheduler will just check how many
TaskManagers are available and then adjust the execution graph accordingly.
There's no need to store a number of scaling events. We just need to
determine the time to trigger an adjustment of the execution graph.
3. What's the behavior wrt to JobManager failures (e.g. we lose the state
of the Adaptive Scheduler?). My proposal would be to just reset the
cooldown period, so after recovery of a JobManager, we have to wait at
least for the cooldown period until further scaling operations are done.
4. What's the relationship to the
"jobmanager.adaptive-scheduler.resource-stabilization-timeout"
configuration?


Thanks a lot for working on this!

Best,
Robert

On Wed, Jun 14, 2023 at 3:38 PM Etienne Chauchot <echauc...@apache.org>
wrote:

> Hi all,
>
> @Yukia,I updated the FLIP to include the aggregation of the staked
> operations that we discussed below PTAL.
>
> Best
>
> Etienne
>
>
> Le 13/06/2023 à 16:31, Etienne Chauchot a écrit :
> > Hi Yuxia,
> >
> > Thanks for your feedback. The number of potentially stacked operations
> > depends on the configured length of the cooldown period.
> >
> >
> >
> > The proposition in the FLIP is to add a minimum delay between 2 scaling
> > operations. But, indeed, an optimization could be to still stack the
> > operations (that arrive during a cooldown period) but maybe not take
> > only the last operation but rather aggregate them in order to end up
> > with a single aggregated operation when the cooldown period ends. For
> > example, let's say 3 taskManagers come up and 1 comes down during the
> > cooldown period, we could generate a single operation of scale up +2
> > when the period ends.
> >
> > As a side note regarding your comment on "it'll take a long time to
> > finish all", please keep in mind that the reactive mode (at least for
> > now) is only available for streaming pipeline which are in essence
> > infinite processing.
> >
> > Another side note: when you mention "every taskManagers connecting",
> > if you are referring to the start of the pipeline, please keep in mind
> > that the adaptive scheduler has a "waiting for resources" timeout
> > period before starting the pipeline in which all taskmanagers connect
> > and the parallelism is decided.
> >
> > Best
> >
> > Etienne
> >
> > Le 13/06/2023 à 03:58, yuxia a écrit :
> >> Hi, Etienne. Thanks for driving it. I have one question about the
> >> mechanism of the cooldown timeout.
> >>
> >> From the Proposed Changes part, if a scalling event is received and
> >> it falls during the cooldown period, it'll be stacked to be executed
> >> after the period ends. Also, from the description of FLINK-21883[1],
> >> cooldown timeout is to avoid rescaling the job very frequently,
> >> because TaskManagers are not all connecting at the same time.
> >>
> >> So, is it possible that every taskmanager connecting will produce a
> >> scalling event and it'll be stacked with many scale up event which
> >> causes it'll take a long time to finish all? Can we just take the
> >> last one event?
> >>
> >> [1]: https://issues.apache.org/jira/browse/FLINK-21883
> >>
> >> Best regards, Yuxia
> >>
> >> ----- 原始邮件 ----- 发件人: "Etienne Chauchot" <echauc...@apache.org>
> >> 收件人:
> >> "dev" <dev@flink.apache.org>, "Robert Metzger" <metrob...@gmail.com>
> >> 发送时间: 星期一, 2023年 6 月 12日 下午 11:34:25 主题: [DISCUSS] FLIP-322
> >> Cooldown
> >> period for adaptive scheduler
> >>
> >> Hi,
> >>
> >> I’d like to start a discussion about FLIP-322 [1] which introduces a
> >> cooldown period for the adaptive scheduler.
> >>
> >> I'd like to get your feedback especially @Robert as you opened the
> >> related ticket and worked on the reactive mode a lot.
> >>
> >> [1]
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-322+Cooldown+period+for+adaptive+scheduler
> >>
> >>
> >>
> > Best
> >>
> >> Etienne

Re: [DISCUSS] FLIP-322 Cooldown period for adaptive scheduler

Reply via email to