Re: [DISCUSS] FLIP-543: Support Customized Autoscale Algorithm

Peter Huang Wed, 03 Sep 2025 09:41:59 -0700

Hi Gyula,

Thanks for the feedback. Discussed offline with Dijeet also. Dijeet and I
will review the diff
https://github.com/apache/flink-kubernetes-operator/pull/953 to see whether
the implementation may meet our requirements. We will work together on the
tasks scoped in FLIP-543.



Best Regards
Peter Huang

On Wed, Sep 3, 2025 at 1:39 AM Gyula Fóra <[email protected]> wrote:

> Hi Peter!
> Sounds like a good plan, it would be great if you could help review the
> PR/finalize the pluggable evaluator logic to make sure it fits your needs.
>
> Cheers,
> Gyula
>
> On Fri, Aug 29, 2025 at 12:17 AM Peter Huang <[email protected]>
> wrote:
>
> > Hi Folks,
> >
> > Thanks for these suggestions. I think we aligned these two features are
> > common and should be implemented in upstream.
> > I try to summarize the AIs below. Please feel free to add more if I
> > miss anything.
> >
> > 1) Finish the planned work in FLIP-514 to support pluggable
> > MetricsEvaluator. Support the scheduled-scaling plugin as planned in
> > FLIP-514
> > 2) Support the Predictive Autoscaling as a configurable feature on top
> of a
> > customized MetricsEvaluator in FLIP-543
> > 3) Support the Data size aware autoscaling as configurable feature on top
> > of a customized MetricsEvaluator in FLIP-543
> >
> > I will revise the FLIP-543 to talk about mainly focus on how Predictive
> > Autoscaling and  Data size aware autoscaling  could be implemented on top
> > of  pluggable MetricsEvaluator.
> >
> > Best Regards
> > Peter Huang
> >
> > On Thu, Aug 28, 2025 at 2:18 AM Rui Fan <[email protected]> wrote:
> >
> > > Hi everyone,
> > >
> > > Thanks for the productive conversation on FLIP-543.
> > >
> > > I agree that we need more extensibility in the autoscaler. The
> predictive
> > > scaling
> > > use case is a perfect example of a powerful feature that would help
> many
> > of
> > > us
> > > improve job availability by scaling before backlogs build up.
> > >
> > > To echo Gyula and Max's points, I also believe the best path forward is
> > to
> > > build
> > > this capability as an extension to the existing framework, not as a
> > > replacement.
> > > This would offer a robust, community-driven solution for a common
> > problem,
> > > which feels more sustainable than asking users to implement and
> maintain
> > > custom forks of the logic.
> > >
> > > Best,
> > > Rui
> > >
> > > On Thu, Aug 28, 2025 at 7:14 AM Pradeepta Choudhury
> > > <[email protected]> wrote:
> > >
> > > > Hello Peter,
> > > >
> > > > To start with, great initiative! But I echo the same concern raised
> > about
> > > > creating too many extension points can compromise the autoscaler
> > > > functionality.
> > > > When we proposed FLIP-514 [1] and a custom evaluator, the aim was
> > > twofold:
> > > > provide the required extension point and ship practical strategies as
> > > > pluggables. At the same time, we wanted to preserve flexibility for
> > > > advanced, highly specific scenarios—like predictive scaling—that
> differ
> > > by
> > > > ecosystem, platform, and company. The custom evaluator strikes that
> > > balance
> > > > was the thought process: it lets users adjust the evaluated
> > > > metrics—especially TARGET_DATA_RATE—that drive the scale-factor
> > > > calculation, enabling useful out-of-the-box behavior without
> > constraining
> > > > bespoke implementations.
> > > > One of the desired outcomes we had set for FLIP-514 was to ship a
> > > > scheduled-scaling strategy as a pluggable, leveraging a baseline
> period
> > > and
> > > > explicit scheduled windows to drive planned capacity changes. I’ve
> been
> > > > away since last month due to personal commitments. I plan to resume
> > after
> > > > first week of September and will complete the scheduled-scaling
> plugin
> > to
> > > > wrap up the custom evaluator.
> > > > Having the ScalingRealizer pluggable (
> > > > https://github.com/apache/flink-kubernetes-operator/pull/1020/files
> ),
> > > > definitely sounds helpful for certain scenarios.
> > > > But I totally agree with the general approach suggested by Gyula,
> about
> > > > solving specific issues independently in the "best possible way" and
> > then
> > > > coming to a good solution regarding pluggability that could be
> > foundation
> > > > for future use-cases.
> > > >
> > > >
> > > > Thanks and Regards
> > > > Pradeepta
> > > >
> > > >
> > > > > On 26 Aug 2025, at 6:05 PM, [email protected] <
> > > > [email protected]> wrote:
> > > > >
> > > > > From the ScalingRealizer, I think having before/after  hooks for
> > > > `realizeParallelismOverrides` and `realizeConfigOverrides` would be
> > good.
> > > > We can support these hooks from plugins, thoughts?
> > > > >
> > > > >
> > > > > Best,
> > > > > Diljeet(DJ) Singh
> > > > >
> > > > > On 2025/08/26 08:24:33 Maximilian Michels wrote:
> > > > >> Hi Peter,
> > > > >>
> > > > >> First of all, this is a great initiative. Flink Autoscaling
> > definitely
> > > > >> needs more points of extension. We recently added support for
> > hooking
> > > > >> into the metric evaluation (FLIP-514), but clearly that is just
> one
> > > > >> extension point.
> > > > >>
> > > > >> That said, I think we will need to revise the approach a bit. I'm
> > not
> > > > >> sure, we should be replacing core components. As Gyula mentioned,
> > > > >> replacing those will easily break the entire autoscaler. Instead,
> we
> > > > >> should be adding extension points which allow for meaningful
> > additions
> > > > >> without breaking the scaling logic. There is already the option to
> > > > >> replace the entire autoscaling module, if users really want to
> roll
> > > > >> out a completely custom version.
> > > > >>
> > > > >> What usually works best is to formulate the use case first, then
> > > > >> figure out what autoscaler customization would be necessary to
> > > > >> implement the use case.
> > > > >>
> > > > >> As for making the ScalingRealizer pluggable
> > > > >> (
> > https://github.com/apache/flink-kubernetes-operator/pull/1020/files
> > > ),
> > > > >> I do think that makes sense for some scenarios.
> > > > >>
> > > > >> Cheers,
> > > > >> Max
> > > > >>
> > > > >> On Tue, Aug 26, 2025 at 8:59 AM Gyula Fóra <[email protected]>
> wrote:
> > > > >>>
> > > > >>> Hi Peter & Diljeet!
> > > > >>>
> > > > >>> My general feedback is that we should try to introduce extension
> > > > plugins instead of plugins that completely replace key parts of the
> > > > autoscaler code.
> > > > >>>
> > > > >>> Let me give you a concrete example through FLIP-514 and FLIP-543
> > > using
> > > > the MetricsEvaluator pluggability.
> > > > >>> The MetricsEvaluator in the autoscaler is responsible for
> > > > evaluating/deriving/calculating metrics from the collected metrics.
> It
> > > has
> > > > to calculate everything in a more or less specific way otherwise
> other
> > > > parts of the autoscaler that depend on these metrics may not work. It
> > > > doesn't seem very practical/resonable to completely reimplement this
> > just
> > > > because someone wants to extend the logic, this is extremely error
> > prone
> > > > and fragile especially if the autoscaler logic later evolves.
> > > > >>>
> > > > >>> FLIP-514 takes the approach to extend the metric evaluator with a
> > new
> > > > method that allows users to at the end modify the evaluated metrics
> and
> > > > define custom ones. This is the right approach here as it makes a new
> > > > extension very simple to build and maintain without interfering with
> > > > existing logic.
> > > > >>>
> > > > >>> The approach in FLIP-543 and in Diljeet's example PR takes the
> > > > replacement approach to completely substitute the entire parts of the
> > > > implementation (the entire evaluator, scaling realizer etc). I think
> > this
> > > > is not very good for either the community or the actual user. From a
> > > > community perspective it makes it harder to extend the logic with
> nice
> > > > small additions and from a user's perspective it is very error probe
> if
> > > the
> > > > operator autoscaler logic changes as it basically exposes a lot of
> > > internal
> > > > logic on a user interface.
> > > > >>>
> > > > >>> So at this point,  -1 for the approach in FLIP-543 from my side,
> > but
> > > I
> > > > would love to hear the opinion of others as well.
> > > > >>>
> > > > >>> Cheers
> > > > >>> Gyula
> > > > >>>
> > > > >>> On Mon, Aug 25, 2025 at 11:44 PM Peter Huang <[email protected]>
> > > wrote:
> > > > >>>>
> > > > >>>> Hi Diljeet,
> > > > >>>>
> > > > >>>> Yes, I think we have similar requirements to make autoscaler
> even
> > > more
> > > > >>>> powerful to handle some customized requirements.
> > > > >>>> The quick PoC makes sense to me. Let's get some more feedback
> from
> > > the
> > > > >>>> community.
> > > > >>>>
> > > > >>>>
> > > > >>>>
> > > > >>>> Best Regards
> > > > >>>> Peter Huang
> > > > >>>>
> > > > >>>>
> > > > >>>>
> > > > >>>> On Mon, Aug 25, 2025 at 2:37 PM Peter Huang <[email protected]>
> > > > >>>> wrote:
> > > > >>>>
> > > > >>>>> Just try to combine the discussion into one thread.
> > > > >>>>>
> > > > >>>>> @Diljeet Singh
> > > > >>>>> Posted a quick PoC for the proposal
> > > > >>>>> https://github.com/apache/flink-kubernetes-operator/pull/1020.
> > > > >>>>>
> > > > >>>>>
> > > > >>>>>
> > > > >>>>>
> > > > >>>>> On Mon, Aug 25, 2025 at 7:52 AM Peter Huang <[email protected]>
> > > > >>>>> wrote:
> > > > >>>>>
> > > > >>>>>> Hi Community,
> > > > >>>>>>
> > > > >>>>>> Our org has been heavily using the Flink autoscaling
> algorithm.
> > It
> > > > >>>>>> greatly reduced our operation overhead and improved cost
> > > efficiency
> > > > >>>>>> as users always over provision resources when onboard.
> Recently,
> > > we
> > > > have
> > > > >>>>>> had some requirements to customize the auto scaling algorithm
> > > > >>>>>> for different scenarios, for example, during the holiday
> season
> > > > large but
> > > > >>>>>> predictable traffic spike, increase checkpoint interval
> together
> > > > with
> > > > >>>>>> scale up for streaming ingestion use cases.
> > > > >>>>>>
> > > > >>>>>> We search through the discussion about the topic in the mail
> > list
> > > > >>>>>> including the existing FLIP-514
> > > > >>>>>> <
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-514%3A+Custom+Evaluator+plugin+for+Flink+Autoscaler
> > > > >.
> > > > >>>>>> Looks like the discussion is not finalized yet.
> > > > >>>>>> To accelerate the process, we adopt and combine the
> > > > >>>>>> existing opinions from the community and create a proposal in
> > > > FLIP-543
> > > > >>>>>> <
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-543%3A+Support+Customized+Autoscale+Algorithm
> > > > >.
> > > > >>>>>> The basic idea
> > > > >>>>>> is to make some core components of autoscaler pluggable, for
> > > > example,
> > > > >>>>>> MetricsCollector, Metrics Evaluator, and ScalingRealizer, at
> the
> > > > same
> > > > >>>>>> keep the core logic skeleton (which is already well justified
> in
> > > > large
> > > > >>>>>> amount of users) of autoscaler untouched.
> > > > >>>>>>
> > > > >>>>>> Looking forward to any feedback and opinions on FLIP-543.
> > > > >>>>>>
> > > > >>>>>> [1]
> > > > >>>>>>
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-543%3A+Support+Customized+Autoscale+Algorithm
> > > > >>>>>> [2]
> > > > >>>>>>
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-514%3A+Custom+Evaluator+plugin+for+Flink+Autoscaler
> > > > >>>>>> [3] Other related discussion thread
> > > > >>>>>>
> > > > >>>>>>
> > https://lists.apache.org/thread/749l74z1h5jylkxrw3rtjmxcj2t9p7ws
> > > > >>>>>>
> > > > >>>>>>
> > https://lists.apache.org/thread/mcd7jcn4kz6oqtyqq5hfycjf9mqh6c53
> > > > >>>>>>
> > > > >>>>>>
> > > > >>>>>> Best Regards
> > > > >>>>>> Peter Huang
> > > > >>>>>>
> > > > >>>>>
> > > >
> > > >
> > >
> >
>

Re: [DISCUSS] FLIP-543: Support Customized Autoscale Algorithm

Reply via email to