Re: [DISCUSS] FLIP-322 Cooldown period for adaptive scheduler

Etienne Chauchot Thu, 06 Jul 2023 04:00:10 -0700

Hi,

I think we have reached a consensus here. I have updated the FLIP toreflect recent suggestions. I will start a new vote.


Best

Etienne

Le 05/07/2023 à 14:42, Etienne Chauchot a écrit :


Hi all,

Thanks David for your suggestions. Comments inline.

Le 04/07/2023 à 13:35, David Morávek a écrit :

waiting 2 min between 2 requirements push seems ok to me

This depends on the workload. Would you care if the cost of rescaling were
close to zero (which is for most out-of-the-box workloads)? In that case,
it would be desirable to rescale more frequently, for example, if TMs join
incrementally.

Creating a value that covers everything is impossible unless it's
self-tuning, so I'd prefer having a smooth experience for people trying
things out (just imagine doing a demo at the conference) and having them
opt-in for longer cooldowns.

The users still have the ability to lower the cooldown period for highworkloads but we could definitely set a default value to a lowernumber. I agree to favo<https://www.linguee.fr/anglais-francais/traduction/favour.html>rlower numbers (for smooth rescale experience) and consider highernumbers (for high workloads) as exceptions. But we still need to agreeon a suitable default for most cases: 30s ?

One idea to keep the timeouts lower while getting more balance would be
restarting the cooldown period when new resources or requirements are
received. This would also bring the cooldown's behavior closer to the
resource-stabilization timeout. Would that make sense?

you mean, if slots are received during the cooldown period instead ofproposed behavior (A), do behavior (B) ?


A. schedule a rescale at lastRescale + cooldown point in time

B. schedule a rescale at ** now ** + cooldown point in time

It looks fine to me. It is even better because it avoids having 2rescales scheduled at the same time if 2 slots change arrive duringthe same cooldown period.



Etienne

Depends on how you implement it. If you ignore all of shouldRescale, yes,

but you shouldn't do that in the first place.



I agree, this is not what I planned to implement.

That sounds great; let's go ahead and outline this in the FLIP.

Best,
D.


On Tue, Jul 4, 2023 at 12:30 PM Etienne Chauchot<[email protected]>
wrote:

Hi all,

Thanks David for your feedback. My comments are inline

Le 04/07/2023 à 09:16, David Morávek a écrit :

They will struggle if they add new resources and nothing happens for 5

minutes.

The same applies if they start playing with FLIP-291 APIs. I'm wondering

if

the cooldown makes sense there since it was the user's deliberate choice

to

push new requirements. 🤔

Sure, but remember that the initial rescale is always done immediately.
Only the time between 2 rescales is controlled by the cooldown period. I
don't see a user adding resources every 10s (your proposed default
value) or even with, let's say 2 min, waiting 2 min between 2
requirements push seems ok to me.

Best,
D.

On Tue, Jul 4, 2023 at 9:11 AM David Morávek<[email protected]>   wrote:

The FLIP reads sane to me. I'm unsure about the default values, though;

minutes of wait time between rescales feels rather strict, and we should
rethink it to provide a better out-of-the-box experience.

I'd focus on newcomers trying AS / Reactive Mode out. They will struggle
if they add new resources and nothing happens for 5 minutes. I'd suggest
defaulting to
*jobmanager.adaptive-scheduler.resource-stabilization-timeout* (which
defaults to 10s).

If users add resources, the re-scale will happen right away. It is only
for next additions that they will have to wait for the coolDown period
to end.

But anyway, we could lower the default value, I just took what Robert
suggested in the ticket.

I'm still struggling to grasp max internal (force rescale). Ignoring

`AdaptiveScheduler#shouldRescale()`

condition seems rather dangerous. Wouldn't a simple case where you add a
new TM and remove it before the max interval is reached (so there is
nothing to do) result in an unnecessary job restart?

With current behavior (on master) : adding the TM will result in
restarting if the number of slots added leads to job parallelism
increase of more than 2. Then removing it can have 2 consequences:
either it is removed before the resource-stabilisation timeout and there
will be no restart. Or it is removed after this timeout (the job is in
Running state) and it will entail another restart and parallelism decrease.

With the proposed behavior: what the scaling-interval.max will change is
only on the resource addition part: when the TM is added, if the time
since last rescale > scaling-interval.max, then a restart and
parallelism increase will be done even if it leads to a parallelism
increase < 2. The rest regarding TM removal does not change.

=> So, the real difference with the current behavior is ** if the slots
addition was too little ** : in the current behavior nothing happens. In
the new behavior nothing happens unless the addition arrives after
scaling-interval.max.


Best

Etienne

Best,
D.

On Thu, Jun 29, 2023 at 3:43 PM Etienne Chauchot<[email protected]>
wrote:

Thanks Chesnay for your feedback. I have updated the FLIP. I'll start a
vote thread.

Best

Etienne

Le 28/06/2023 à 11:49, Chesnay Schepler a écrit :

we should schedule a check that will rescale if

min-parallelism-increase is met. Then, what it the use of
scaling-interval.max timeout in that context ?

To force a rescale if min-parallelism-increase is not met (but we
could still run above the current parallelism).

min-parallelism-increase is a trade-off between the cost of rescaling
vs the performance benefit of the parallelism increase. Over time the
balance tips more and more in favor of the parallelism increase, hence
we should eventually rescale anyway even if the minimum isn't met, or
at least give users the option to do so.

I meant the opposite: not having only the cooldown but having only

the stabilization time. I must have missed something because what I
wonder is: if every rescale entails a restart of the pipeline and
every restart entails passing in waiting for resources state, then why
introduce a cooldown when there is already at each rescale a stable
resource timeout ?

It is technically correct that the stable resource timeout can be used
to limit the number of rescale operations per interval, however during
that time the job isn't running, in contrast to the cooldown.

Having both just gives you a lot more flexibility.
"I want at most 1 rescale operation per hour, and wait at most 1
minute for resource to stabilize when a rescale happens".
You can't express this with only one of the options.

On 20/06/2023 14:41, Etienne Chauchot wrote:

Hi Chesnay,

Thanks for your feedback. Comments inline

Le 16/06/2023 à 17:24, Chesnay Schepler a écrit :

1) Options specific to the adaptive scheduler should start with
"jobmanager.adaptive-scheduler".

ok

2)
There isn't /really /a notion of a "scaling event". The scheduler is
informed about new/lost slots and job failures, and reacts
accordingly by maybe rescaling the job.
(sure, you can think of these as events, but you can think of
practically everything as events)

There shouldn't be a queue for events. All the scheduler should have
to know is that the next rescale check is scheduled for time T,
which in practice boils down to a flag and a scheduled action that
runs Executing#maybeRescale.

Makes total sense, its very simple like this. Thanks for the
precision and pointer. After the related FLIPs, I'll look at the code
now.

With that in mind, we also have to look at how we keep this state
around. Presumably it is scoped to the current state, such that the
cooldown is reset if a job fails.
Maybe we should add a separate ExecutingWithCooldown state; not sure
yet.

Yes loosing cooldown state and cooldown reset upon failure is what I
suggested in point 3 in previous email. Not sure either for a new
state, I'll figure it out after experimenting with the code. I'll
update the FLIP then.

It would be good to clarify whether this FLIP only attempts to cover
scale up operations, or also scale downs in case of slot losses.

When there are slots loss, most of the time it is due to a TM loss so
there should be several slots lost at the same time but (hopefully)
only once. There should not be many scale downs in a row (but still
cascading failures can happen). I think, we should just protect
against having scale ups immediately following. For that, I think we
could just keep the current behavior of transitioning to Restarting
state and then back to Waiting for Resources state. This state will
protect us against scale ups immediately following failure/restart.

We should also think about how it relates to the externalized
declarative resource management. Should we always rescale
immediately? Should we wait until the cooldown is over?

It relates to point 2, no ? we should rescale immediately only if
last rescale was done more than scaling-interval.min ago otherwise
schedule a rescale at last-rescale + scaling-interval.min time.

Related to this, there's the min-parallelism-increase option, that
if for example set to "2" restricts rescale operations to only occur
if the parallelism increases by at least 2.

yes I saw that in the code

Ideally however there would be a max timeout for this.

As such we could maybe think about this a bit differently:
Add 2 new options instead of 1:
jobmanager.adaptive-scheduler.scaling-interval.min: The minimum time
the scheduler will wait for the next effective rescale operations.
jobmanager.adaptive-scheduler.scaling-interval.max: The maximum time
the scheduler will wait for the next effective rescale operations.

At point 2, we said that when slots change (requirements change or
new slots available), if last rescale check (call to maybeRescale)
was done less than scaling-interval.min ago, we should schedule a
check that will rescale if min-parallelism-increase is met. Then,
what it the use of scaling-interval.max timeout in that context ?

3) It sounds fine that we lose the cooldown state, because imo we
want to reset the cooldown anyway on job failures (because a job
failure inherently implies a potential rescaling).

exactly.

4) The stabilization time isn't really redundant and serves a
different use-case. The idea behind is that if a users adds multiple
TMs at once then we don't want to rescale immediately at the first
received slot. Without the stabilization time the cooldown would
actually cause bad behavior here, because not only would we rescale
immediately upon receiving the minimum required slots to scale up,
but we also wouldn't use the remaining slots just because the
cooldown says so.

I meant the opposite: not having only the cooldown but having only
the stabilization time. I must have missed something because what I
wonder is: if every rescale entails a restart of the pipeline and
every restart entails passing in waiting for resources state, then
why introduce a cooldown when there is already at each rescale a
stable resource timeout ?


Best

Etienne

On 16/06/2023 15:47, Etienne Chauchot wrote:

Hi Robert,

Thanks for your feedback. I don't know the scheduler part well
enough yet and I'm taking this ticket as a learning workshop.

Regarding your comments:

1. Taking a look at the AdaptiveScheduler class which takes all its
configuration from the JobManagerOptions, and also to be consistent
with other parameters name, I'd suggest
/jobmanager.scheduler-scaling-cooldown-period/

2. I thought scaling events existed already and the scheduler
received them as mentioned in FLIP-160 (cf "Whenever the scheduler
is in the Executing state and receives new slots") or in FLIP-138
(cf "Whenever new slots are available the SlotPool notifies the
Scheduler"). If it is not the case (it is the scheduler who asks
for slots), then there is no need for storing scaling requests

indeed.

=> I need a confirmation here

3. If we loose the JobManager, we loose both the AdaptiveScheduler
state and the CoolDownTimer state. So, upon recovery, it would be
as if there was no ongoing coolDown period. So, a first re-scale
could happen right away and it will start a coolDown period. A
second re-scale would have to wait for the end of this period.

4. When a pipeline is re-scaled, it is restarted. Upon restart, the
AdaptiveScheduler passes again in the "waiting for resources" state
as FLIP-160 suggests. If so, then it seems that the coolDown period
is kind of redundant with the resource-stabilization-timeout. I
guess it is not the case otherwise the FLINK-21883 ticket would not
have been created.

=> I need a confirmation here also.


Thanks for your views on point 2 and 4.


Best

Etienne

Le 15/06/2023 à 13:35, Robert Metzger a écrit :

Thanks for the FLIP.

Some comments:
1. Can you specify the full proposed configuration name? "
scaling-cooldown-period" is probably not the full config name?
2. Why is the concept of scaling events and a scaling queue
needed? If I
remember correctly, the adaptive scheduler will just check how

many

TaskManagers are available and then adjust the execution graph
accordingly.
There's no need to store a number of scaling events. We just need

to

determine the time to trigger an adjustment of the execution

graph.

3. What's the behavior wrt to JobManager failures (e.g. we lose
the state
of the Adaptive Scheduler?). My proposal would be to just reset

the

cooldown period, so after recovery of a JobManager, we have to
wait at
least for the cooldown period until further scaling operations are
done.
4. What's the relationship to the
"jobmanager.adaptive-scheduler.resource-stabilization-timeout"
configuration?

Thanks a lot for working on this!

Best,
Robert

On Wed, Jun 14, 2023 at 3:38 PM Etienne
Chauchot<[email protected]>
wrote:

Hi all,

@Yukia,I updated the FLIP to include the aggregation of the

staked

operations that we discussed below PTAL.

Best

Etienne


Le 13/06/2023 à 16:31, Etienne Chauchot a écrit :

Hi Yuxia,

Thanks for your feedback. The number of potentially stacked
operations
depends on the configured length of the cooldown period.



The proposition in the FLIP is to add a minimum delay between 2
scaling
operations. But, indeed, an optimization could be to still stack
the
operations (that arrive during a cooldown period) but maybe not
take
only the last operation but rather aggregate them in order to
end up
with a single aggregated operation when the cooldown period
ends. For
example, let's say 3 taskManagers come up and 1 comes down
during the
cooldown period, we could generate a single operation of scale
up +2
when the period ends.

As a side note regarding your comment on "it'll take a long time

to

finish all", please keep in mind that the reactive mode (at
least for
now) is only available for streaming pipeline which are in

essence

infinite processing.

Another side note: when you mention "every taskManagers
connecting",
if you are referring to the start of the pipeline, please keep
in mind
that the adaptive scheduler has a "waiting for resources"

timeout

period before starting the pipeline in which all taskmanagers
connect
and the parallelism is decided.

Best

Etienne

Le 13/06/2023 à 03:58, yuxia a écrit :

Hi, Etienne. Thanks for driving it. I have one question about

the

mechanism of the cooldown timeout.

   From the Proposed Changes part, if a scalling event is
received and
it falls during the cooldown period, it'll be stacked to be
executed
after the period ends. Also, from the description of
FLINK-21883[1],
cooldown timeout is to avoid rescaling the job very frequently,
because TaskManagers are not all connecting at the same time.

So, is it possible that every taskmanager connecting will
produce a
scalling event and it'll be stacked with many scale up event

which

causes it'll take a long time to finish all? Can we just take

the

last one event?

[1]:https://issues.apache.org/jira/browse/FLINK-21883

Best regards, Yuxia

----- 原始邮件 ----- 发件人: "Etienne
Chauchot"<[email protected]>
收件人:
"dev"<[email protected]>, "Robert Metzger"<

[email protected]>

发送时间: 星期一, 2023年 6 月 12日 下午 11:34:25 主题: [DISCUSS]
FLIP-322
Cooldown
period for adaptive scheduler

Hi,

I’d like to start a discussion about FLIP-322 [1] which
introduces a
cooldown period for the adaptive scheduler.

I'd like to get your feedback especially @Robert as you opened

the

related ticket and worked on the reactive mode a lot.

[1]

https://cwiki.apache.org/confluence/display/FLINK/FLIP-322+Cooldown+period+for+adaptive+scheduler

Best

Etienne

Re: [DISCUSS] FLIP-322 Cooldown period for adaptive scheduler

Reply via email to