Re: [DISCUSS] FLIP-322 Cooldown period for adaptive scheduler

Etienne Chauchot Thu, 29 Jun 2023 06:43:56 -0700

Thanks Chesnay for your feedback. I have updated the FLIP. I'll start avote thread.


Best


Etienne

Le 28/06/2023 à 11:49, Chesnay Schepler a écrit :

> we should schedule a check that will rescale ifmin-parallelism-increase is met. Then, what it the use ofscaling-interval.max timeout in that context ?
To force a rescale if min-parallelism-increase is not met (but wecould still run above the current parallelism).
min-parallelism-increase is a trade-off between the cost of rescalingvs the performance benefit of the parallelism increase. Over time thebalance tips more and more in favor of the parallelism increase, hencewe should eventually rescale anyway even if the minimum isn't met, orat least give users the option to do so.
> I meant the opposite: not having only the cooldown but having onlythe stabilization time. I must have missed something because what Iwonder is: if every rescale entails a restart of the pipeline andevery restart entails passing in waiting for resources state, then whyintroduce a cooldown when there is already at each rescale a stableresource timeout ?
It is technically correct that the stable resource timeout can be usedto limit the number of rescale operations per interval, however duringthat time the job isn't running, in contrast to the cooldown.
Having both just gives you a lot more flexibility.
"I want at most 1 rescale operation per hour, and wait at most 1minute for resource to stabilize when a rescale happens".
You can't express this with only one of the options.

On 20/06/2023 14:41, Etienne Chauchot wrote:
Hi Chesnay,

Thanks for your feedback. Comments inline

Le 16/06/2023 à 17:24, Chesnay Schepler a écrit :
1) Options specific to the adaptive scheduler should start with"jobmanager.adaptive-scheduler".
ok
2)
There isn't /really /a notion of a "scaling event". The scheduler isinformed about new/lost slots and job failures, and reactsaccordingly by maybe rescaling the job.(sure, you can think of these as events, but you can think ofpractically everything as events)
There shouldn't be a queue for events. All the scheduler should haveto know is that the next rescale check is scheduled for time T,which in practice boils down to a flag and a scheduled action thatruns Executing#maybeRescale.
Makes total sense, its very simple like this. Thanks for theprecision and pointer. After the related FLIPs, I'll look at the codenow.
With that in mind, we also have to look at how we keep this statearound. Presumably it is scoped to the current state, such that thecooldown is reset if a job fails.Maybe we should add a separate ExecutingWithCooldown state; not sureyet.
Yes loosing cooldown state and cooldown reset upon failure is what Isuggested in point 3 in previous email. Not sure either for a newstate, I'll figure it out after experimenting with the code. I'llupdate the FLIP then.
It would be good to clarify whether this FLIP only attempts to coverscale up operations, or also scale downs in case of slot losses.
When there are slots loss, most of the time it is due to a TM loss sothere should be several slots lost at the same time but (hopefully)only once. There should not be many scale downs in a row (but stillcascading failures can happen). I think, we should just protectagainst having scale ups immediately following. For that, I think wecould just keep the current behavior of transitioning to Restartingstate and then back to Waiting for Resources state. This state willprotect us against scale ups immediately following failure/restart.
We should also think about how it relates to the externalizeddeclarative resource management. Should we always rescaleimmediately? Should we wait until the cooldown is over?
It relates to point 2, no ? we should rescale immediately only iflast rescale was done more than scaling-interval.min ago otherwiseschedule a rescale at last-rescale + scaling-interval.min time.
Related to this, there's the min-parallelism-increase option, thatif for example set to "2" restricts rescale operations to only occurif the parallelism increases by at least 2.
yes I saw that in the code
Ideally however there would be a max timeout for this.

As such we could maybe think about this a bit differently:
Add 2 new options instead of 1:
jobmanager.adaptive-scheduler.scaling-interval.min: The minimum timethe scheduler will wait for the next effective rescale operations.jobmanager.adaptive-scheduler.scaling-interval.max: The maximum timethe scheduler will wait for the next effective rescale operations.
At point 2, we said that when slots change (requirements change ornew slots available), if last rescale check (call to maybeRescale)was done less than scaling-interval.min ago, we should schedule acheck that will rescale if min-parallelism-increase is met. Then,what it the use of scaling-interval.max timeout in that context ?
3) It sounds fine that we lose the cooldown state, because imo wewant to reset the cooldown anyway on job failures (because a jobfailure inherently implies a potential rescaling).
exactly.
4) The stabilization time isn't really redundant and serves adifferent use-case. The idea behind is that if a users adds multipleTMs at once then we don't want to rescale immediately at the firstreceived slot. Without the stabilization time the cooldown wouldactually cause bad behavior here, because not only would we rescaleimmediately upon receiving the minimum required slots to scale up,but we also wouldn't use the remaining slots just because thecooldown says so.
I meant the opposite: not having only the cooldown but having onlythe stabilization time. I must have missed something because what Iwonder is: if every rescale entails a restart of the pipeline andevery restart entails passing in waiting for resources state, thenwhy introduce a cooldown when there is already at each rescale astable resource timeout ?
Best

Etienne
On 16/06/2023 15:47, Etienne Chauchot wrote:
Hi Robert,
Thanks for your feedback. I don't know the scheduler part wellenough yet and I'm taking this ticket as a learning workshop.
Regarding your comments:
1. Taking a look at the AdaptiveScheduler class which takes all itsconfiguration from the JobManagerOptions, and also to be consistentwith other parameters name, I'd suggest/jobmanager.scheduler-scaling-cooldown-period/
2. I thought scaling events existed already and the schedulerreceived them as mentioned in FLIP-160 (cf "Whenever the scheduleris in the Executing state and receives new slots") or in FLIP-138(cf "Whenever new slots are available the SlotPool notifies theScheduler"). If it is not the case (it is the scheduler who asksfor slots), then there is no need for storing scaling requests indeed.
=> I need a confirmation here
3. If we loose the JobManager, we loose both the AdaptiveSchedulerstate and the CoolDownTimer state. So, upon recovery, it would beas if there was no ongoing coolDown period. So, a first re-scalecould happen right away and it will start a coolDown period. Asecond re-scale would have to wait for the end of this period.
4. When a pipeline is re-scaled, it is restarted. Upon restart, theAdaptiveScheduler passes again in the "waiting for resources" stateas FLIP-160 suggests. If so, then it seems that the coolDown periodis kind of redundant with the resource-stabilization-timeout. Iguess it is not the case otherwise the FLINK-21883 ticket would nothave been created.
=> I need a confirmation here also.


Thanks for your views on point 2 and 4.


Best

Etienne

Le 15/06/2023 à 13:35, Robert Metzger a écrit :
Thanks for the FLIP.

Some comments:
1. Can you specify the full proposed configuration name? "
scaling-cooldown-period" is probably not the full config name?
2. Why is the concept of scaling events and a scaling queueneeded? If I
remember correctly, the adaptive scheduler will just check how many
TaskManagers are available and then adjust the execution graphaccordingly.
There's no need to store a number of scaling events. We just need to
determine the time to trigger an adjustment of the execution graph.
3. What's the behavior wrt to JobManager failures (e.g. we losethe state
of the Adaptive Scheduler?). My proposal would be to just reset the
cooldown period, so after recovery of a JobManager, we have towait atleast for the cooldown period until further scaling operations aredone.
4. What's the relationship to the
"jobmanager.adaptive-scheduler.resource-stabilization-timeout"
configuration?

Thanks a lot for working on this!

Best,
Robert
On Wed, Jun 14, 2023 at 3:38 PM EtienneChauchot<[email protected]>
wrote:
Hi all,

@Yukia,I updated the FLIP to include the aggregation of the staked
operations that we discussed below PTAL.

Best

Etienne


Le 13/06/2023 à 16:31, Etienne Chauchot a écrit :
Hi Yuxia,
Thanks for your feedback. The number of potentially stackedoperations
depends on the configured length of the cooldown period.
The proposition in the FLIP is to add a minimum delay between 2scalingoperations. But, indeed, an optimization could be to still stacktheoperations (that arrive during a cooldown period) but maybe nottakeonly the last operation but rather aggregate them in order toend upwith a single aggregated operation when the cooldown periodends. Forexample, let's say 3 taskManagers come up and 1 comes downduring thecooldown period, we could generate a single operation of scaleup +2
when the period ends.

As a side note regarding your comment on "it'll take a long time to
finish all", please keep in mind that the reactive mode (atleast for
now) is only available for streaming pipeline which are in essence
infinite processing.
Another side note: when you mention "every taskManagersconnecting",if you are referring to the start of the pipeline, please keepin mind
that the adaptive scheduler has a "waiting for resources" timeout
period before starting the pipeline in which all taskmanagersconnect
and the parallelism is decided.

Best

Etienne

Le 13/06/2023 à 03:58, yuxia a écrit :
Hi, Etienne. Thanks for driving it. I have one question about the
mechanism of the cooldown timeout.
From the Proposed Changes part, if a scalling event isreceived andit falls during the cooldown period, it'll be stacked to beexecutedafter the period ends. Also, from the description ofFLINK-21883[1],
cooldown timeout is to avoid rescaling the job very frequently,
because TaskManagers are not all connecting at the same time.
So, is it possible that every taskmanager connecting willproduce a
scalling event and it'll be stacked with many scale up event which
causes it'll take a long time to finish all? Can we just take the
last one event?

[1]:https://issues.apache.org/jira/browse/FLINK-21883

Best regards, Yuxia
----- 原始邮件 ----- 发件人: "EtienneChauchot"<[email protected]>
收件人:
"dev"<[email protected]>, "Robert Metzger"<[email protected]>
发送时间: 星期一, 2023年 6 月 12日下午 11:34:25 主题: [DISCUSS]FLIP-322
Cooldown
period for adaptive scheduler

Hi,
I’d like to start a discussion about FLIP-322 [1] whichintroduces a
cooldown period for the adaptive scheduler.

I'd like to get your feedback especially @Robert as you opened the
related ticket and worked on the reactive mode a lot.

[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-322+Cooldown+period+for+adaptive+scheduler
Best
Etienne

Re: [DISCUSS] FLIP-322 Cooldown period for adaptive scheduler

Reply via email to