Re: [DISCUSS] FLIP-271: Autoscaling

Pedro Silva Sat, 05 Nov 2022 09:16:32 -0700

Hello,

First of all thank you for tackling this theme, it is massive boon to Flink if 
it gets in.


Following up on JunRui Lee’s question. 
Have you considered making metrics collection getting triggered based on events 
rather than periodic checks?

I.e if input source lag is increasing for the past x amount of time -> trigger 
a metric collection to understand what to scale, if anything.

For kubernetes loads there is KEDA that does this: 
https://keda.sh/docs/2.8/scalers/prometheus/

My apologies if the question doesn’t make sense.

Thank you for your time,
Pedro Silva

> 
> On 5 Nov 2022, at 08:09, JunRui Lee <[email protected]> wrote:
> 
> Hi Max,
> 
> Thanks for writing this FLIP and initiating the discussion.
> 
> I just have a small question after reading the FLIP:
> 
> In the document, I didn't find the definition of when to trigger
> autoScaling after some jobVertex reach the threshold. If I missed is,
> please let me know.
> IIUC, the proper triggering rules are necessary to avoid unnecessary
> autoscaling caused by temporary large changes in data,
> and in this case, it will lead to at least two meaningless resubmissions of
> jobs, which will negatively affect users.
> 
> Thanks,
> JunRui Lee
> 
> Gyula Fóra <[email protected]> 于2022年11月5日周六 20:38写道：
> 
>> Hey!
>> Thanks for the input!
>> The algorithm does not really differentiate between scaling up or down as
>> it’s concerned about finding the right parallelism to match the target
>> processing rate with just enough spare capacity.
>> Let me try to address your specific points:
>> 1. The backlog growth rate only matters for computing the target processing
>> rate for the sources. If the parallelism is high enough and there is no
>> back pressure it will be close to 0 so the target rate is the source read
>> rate. This is as intended. If we see that the sources are not busy and they
>> can read more than enough the algorithm would scale them down.
>> 2. You are right , it’s dangerous to scale in too much, so we already
>> thought about limiting the scale down amount per scaling step/time window
>> to give more safety. But we can definitely think about different strategies
>> in the future!
>> The observation regarding max parallelism is very important and we should
>> always take that into consideration.
>> Cheers
>> Gyula
>>> On Sat, 5 Nov 2022 at 11:46, Biao Geng <[email protected]> wrote:
>>> Hi Max,
>>> Thanks a lot for the FLIP. It is an extremely attractive feature!
>>> Just some follow up questions/thoughts after reading the FLIP:
>>> In the doc, the discussion of  the strategy of “scaling out” is thorough
>>> and convincing to me but it seems that “scaling down” is less discussed.
>> I
>>> have 2 cents for this aspect:
>>> 1.  For source parallelisms, if the user configure a much larger value
>>> than normal, there should be very little pending records though it is
>>> possible to get optimized. But IIUC, in current algorithm, we will not
>> take
>>> actions for this case as the backlog growth rate is almost zero. Is the
>>> understanding right?
>>> 2.  Compared with “scaling out”, “scaling in” is usually more dangerous
>>> as it is more likely to lead to negative influence to the downstream
>> jobs.
>>> The min/max load bounds should be useful. I am wondering if it is
>> possible
>>> to have different strategy for “scaling in” to make it more conservative.
>>> Or more eagerly, allow custom autoscaling strategy(e.g. time-based
>>> strategy).
>>> Another side thought is that to recover a job from checkpoint/savepoint,
>>> the new parallelism cannot be larger than max parallelism defined in the
>>> checkpoint(see this<
>> https://github.com/apache/flink/blob/17a782c202c93343b8884cb52f4562f9c4ba593f/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/Checkpoints.java#L128
>>> ).
>>> Not sure if this limit should be mentioned in the FLIP.
>>> Again, thanks for the great work and looking forward to using flink k8s
>>> operator with it!
>>> Best,
>>> Biao Geng
>>> From: Maximilian Michels <[email protected]>
>>> Date: Saturday, November 5, 2022 at 2:37 AM
>>> To: dev <[email protected]>
>>> Cc: Gyula Fóra <[email protected]>, Thomas Weise <[email protected]>,
>>> Marton Balassi <[email protected]>, Őrhidi Mátyás <
>>> [email protected]>
>>> Subject: [DISCUSS] FLIP-271: Autoscaling
>>> Hi,
>>> I would like to kick off the discussion on implementing autoscaling for
>>> Flink as part of the Flink Kubernetes operator. I've outlined an approach
>>> here which I find promising:
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-271%3A+Autoscaling
>>> I've been discussing this approach with some of the operator
>> contributors:
>>> Gyula, Marton, Matyas, and Thomas (all in CC). We started prototyping an
>>> implementation based on the current FLIP design. If that goes well, we
>>> would like to contribute this to Flink based on the results of the
>>> discussion here.
>>> I'm curious to hear your thoughts.
>>> -Max

Re: [DISCUSS] FLIP-271: Autoscaling

Reply via email to