Hi Gyula, Thanks for the feedback. Discussed offline with Dijeet also. Dijeet and I will review the diff https://github.com/apache/flink-kubernetes-operator/pull/953 to see whether the implementation may meet our requirements. We will work together on the tasks scoped in FLIP-543.
Best Regards Peter Huang On Wed, Sep 3, 2025 at 1:39 AM Gyula Fóra <[email protected]> wrote: > Hi Peter! > Sounds like a good plan, it would be great if you could help review the > PR/finalize the pluggable evaluator logic to make sure it fits your needs. > > Cheers, > Gyula > > On Fri, Aug 29, 2025 at 12:17 AM Peter Huang <[email protected]> > wrote: > > > Hi Folks, > > > > Thanks for these suggestions. I think we aligned these two features are > > common and should be implemented in upstream. > > I try to summarize the AIs below. Please feel free to add more if I > > miss anything. > > > > 1) Finish the planned work in FLIP-514 to support pluggable > > MetricsEvaluator. Support the scheduled-scaling plugin as planned in > > FLIP-514 > > 2) Support the Predictive Autoscaling as a configurable feature on top > of a > > customized MetricsEvaluator in FLIP-543 > > 3) Support the Data size aware autoscaling as configurable feature on top > > of a customized MetricsEvaluator in FLIP-543 > > > > I will revise the FLIP-543 to talk about mainly focus on how Predictive > > Autoscaling and Data size aware autoscaling could be implemented on top > > of pluggable MetricsEvaluator. > > > > Best Regards > > Peter Huang > > > > On Thu, Aug 28, 2025 at 2:18 AM Rui Fan <[email protected]> wrote: > > > > > Hi everyone, > > > > > > Thanks for the productive conversation on FLIP-543. > > > > > > I agree that we need more extensibility in the autoscaler. The > predictive > > > scaling > > > use case is a perfect example of a powerful feature that would help > many > > of > > > us > > > improve job availability by scaling before backlogs build up. > > > > > > To echo Gyula and Max's points, I also believe the best path forward is > > to > > > build > > > this capability as an extension to the existing framework, not as a > > > replacement. > > > This would offer a robust, community-driven solution for a common > > problem, > > > which feels more sustainable than asking users to implement and > maintain > > > custom forks of the logic. > > > > > > Best, > > > Rui > > > > > > On Thu, Aug 28, 2025 at 7:14 AM Pradeepta Choudhury > > > <[email protected]> wrote: > > > > > > > Hello Peter, > > > > > > > > To start with, great initiative! But I echo the same concern raised > > about > > > > creating too many extension points can compromise the autoscaler > > > > functionality. > > > > When we proposed FLIP-514 [1] and a custom evaluator, the aim was > > > twofold: > > > > provide the required extension point and ship practical strategies as > > > > pluggables. At the same time, we wanted to preserve flexibility for > > > > advanced, highly specific scenarios—like predictive scaling—that > differ > > > by > > > > ecosystem, platform, and company. The custom evaluator strikes that > > > balance > > > > was the thought process: it lets users adjust the evaluated > > > > metrics—especially TARGET_DATA_RATE—that drive the scale-factor > > > > calculation, enabling useful out-of-the-box behavior without > > constraining > > > > bespoke implementations. > > > > One of the desired outcomes we had set for FLIP-514 was to ship a > > > > scheduled-scaling strategy as a pluggable, leveraging a baseline > period > > > and > > > > explicit scheduled windows to drive planned capacity changes. I’ve > been > > > > away since last month due to personal commitments. I plan to resume > > after > > > > first week of September and will complete the scheduled-scaling > plugin > > to > > > > wrap up the custom evaluator. > > > > Having the ScalingRealizer pluggable ( > > > > https://github.com/apache/flink-kubernetes-operator/pull/1020/files > ), > > > > definitely sounds helpful for certain scenarios. > > > > But I totally agree with the general approach suggested by Gyula, > about > > > > solving specific issues independently in the "best possible way" and > > then > > > > coming to a good solution regarding pluggability that could be > > foundation > > > > for future use-cases. > > > > > > > > > > > > Thanks and Regards > > > > Pradeepta > > > > > > > > > > > > > On 26 Aug 2025, at 6:05 PM, [email protected] < > > > > [email protected]> wrote: > > > > > > > > > > From the ScalingRealizer, I think having before/after hooks for > > > > `realizeParallelismOverrides` and `realizeConfigOverrides` would be > > good. > > > > We can support these hooks from plugins, thoughts? > > > > > > > > > > > > > > > Best, > > > > > Diljeet(DJ) Singh > > > > > > > > > > On 2025/08/26 08:24:33 Maximilian Michels wrote: > > > > >> Hi Peter, > > > > >> > > > > >> First of all, this is a great initiative. Flink Autoscaling > > definitely > > > > >> needs more points of extension. We recently added support for > > hooking > > > > >> into the metric evaluation (FLIP-514), but clearly that is just > one > > > > >> extension point. > > > > >> > > > > >> That said, I think we will need to revise the approach a bit. I'm > > not > > > > >> sure, we should be replacing core components. As Gyula mentioned, > > > > >> replacing those will easily break the entire autoscaler. Instead, > we > > > > >> should be adding extension points which allow for meaningful > > additions > > > > >> without breaking the scaling logic. There is already the option to > > > > >> replace the entire autoscaling module, if users really want to > roll > > > > >> out a completely custom version. > > > > >> > > > > >> What usually works best is to formulate the use case first, then > > > > >> figure out what autoscaler customization would be necessary to > > > > >> implement the use case. > > > > >> > > > > >> As for making the ScalingRealizer pluggable > > > > >> ( > > https://github.com/apache/flink-kubernetes-operator/pull/1020/files > > > ), > > > > >> I do think that makes sense for some scenarios. > > > > >> > > > > >> Cheers, > > > > >> Max > > > > >> > > > > >> On Tue, Aug 26, 2025 at 8:59 AM Gyula Fóra <[email protected]> > wrote: > > > > >>> > > > > >>> Hi Peter & Diljeet! > > > > >>> > > > > >>> My general feedback is that we should try to introduce extension > > > > plugins instead of plugins that completely replace key parts of the > > > > autoscaler code. > > > > >>> > > > > >>> Let me give you a concrete example through FLIP-514 and FLIP-543 > > > using > > > > the MetricsEvaluator pluggability. > > > > >>> The MetricsEvaluator in the autoscaler is responsible for > > > > evaluating/deriving/calculating metrics from the collected metrics. > It > > > has > > > > to calculate everything in a more or less specific way otherwise > other > > > > parts of the autoscaler that depend on these metrics may not work. It > > > > doesn't seem very practical/resonable to completely reimplement this > > just > > > > because someone wants to extend the logic, this is extremely error > > prone > > > > and fragile especially if the autoscaler logic later evolves. > > > > >>> > > > > >>> FLIP-514 takes the approach to extend the metric evaluator with a > > new > > > > method that allows users to at the end modify the evaluated metrics > and > > > > define custom ones. This is the right approach here as it makes a new > > > > extension very simple to build and maintain without interfering with > > > > existing logic. > > > > >>> > > > > >>> The approach in FLIP-543 and in Diljeet's example PR takes the > > > > replacement approach to completely substitute the entire parts of the > > > > implementation (the entire evaluator, scaling realizer etc). I think > > this > > > > is not very good for either the community or the actual user. From a > > > > community perspective it makes it harder to extend the logic with > nice > > > > small additions and from a user's perspective it is very error probe > if > > > the > > > > operator autoscaler logic changes as it basically exposes a lot of > > > internal > > > > logic on a user interface. > > > > >>> > > > > >>> So at this point, -1 for the approach in FLIP-543 from my side, > > but > > > I > > > > would love to hear the opinion of others as well. > > > > >>> > > > > >>> Cheers > > > > >>> Gyula > > > > >>> > > > > >>> On Mon, Aug 25, 2025 at 11:44 PM Peter Huang <[email protected]> > > > wrote: > > > > >>>> > > > > >>>> Hi Diljeet, > > > > >>>> > > > > >>>> Yes, I think we have similar requirements to make autoscaler > even > > > more > > > > >>>> powerful to handle some customized requirements. > > > > >>>> The quick PoC makes sense to me. Let's get some more feedback > from > > > the > > > > >>>> community. > > > > >>>> > > > > >>>> > > > > >>>> > > > > >>>> Best Regards > > > > >>>> Peter Huang > > > > >>>> > > > > >>>> > > > > >>>> > > > > >>>> On Mon, Aug 25, 2025 at 2:37 PM Peter Huang <[email protected]> > > > > >>>> wrote: > > > > >>>> > > > > >>>>> Just try to combine the discussion into one thread. > > > > >>>>> > > > > >>>>> @Diljeet Singh > > > > >>>>> Posted a quick PoC for the proposal > > > > >>>>> https://github.com/apache/flink-kubernetes-operator/pull/1020. > > > > >>>>> > > > > >>>>> > > > > >>>>> > > > > >>>>> > > > > >>>>> On Mon, Aug 25, 2025 at 7:52 AM Peter Huang <[email protected]> > > > > >>>>> wrote: > > > > >>>>> > > > > >>>>>> Hi Community, > > > > >>>>>> > > > > >>>>>> Our org has been heavily using the Flink autoscaling > algorithm. > > It > > > > >>>>>> greatly reduced our operation overhead and improved cost > > > efficiency > > > > >>>>>> as users always over provision resources when onboard. > Recently, > > > we > > > > have > > > > >>>>>> had some requirements to customize the auto scaling algorithm > > > > >>>>>> for different scenarios, for example, during the holiday > season > > > > large but > > > > >>>>>> predictable traffic spike, increase checkpoint interval > together > > > > with > > > > >>>>>> scale up for streaming ingestion use cases. > > > > >>>>>> > > > > >>>>>> We search through the discussion about the topic in the mail > > list > > > > >>>>>> including the existing FLIP-514 > > > > >>>>>> < > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-514%3A+Custom+Evaluator+plugin+for+Flink+Autoscaler > > > > >. > > > > >>>>>> Looks like the discussion is not finalized yet. > > > > >>>>>> To accelerate the process, we adopt and combine the > > > > >>>>>> existing opinions from the community and create a proposal in > > > > FLIP-543 > > > > >>>>>> < > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-543%3A+Support+Customized+Autoscale+Algorithm > > > > >. > > > > >>>>>> The basic idea > > > > >>>>>> is to make some core components of autoscaler pluggable, for > > > > example, > > > > >>>>>> MetricsCollector, Metrics Evaluator, and ScalingRealizer, at > the > > > > same > > > > >>>>>> keep the core logic skeleton (which is already well justified > in > > > > large > > > > >>>>>> amount of users) of autoscaler untouched. > > > > >>>>>> > > > > >>>>>> Looking forward to any feedback and opinions on FLIP-543. > > > > >>>>>> > > > > >>>>>> [1] > > > > >>>>>> > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-543%3A+Support+Customized+Autoscale+Algorithm > > > > >>>>>> [2] > > > > >>>>>> > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-514%3A+Custom+Evaluator+plugin+for+Flink+Autoscaler > > > > >>>>>> [3] Other related discussion thread > > > > >>>>>> > > > > >>>>>> > > https://lists.apache.org/thread/749l74z1h5jylkxrw3rtjmxcj2t9p7ws > > > > >>>>>> > > > > >>>>>> > > https://lists.apache.org/thread/mcd7jcn4kz6oqtyqq5hfycjf9mqh6c53 > > > > >>>>>> > > > > >>>>>> > > > > >>>>>> Best Regards > > > > >>>>>> Peter Huang > > > > >>>>>> > > > > >>>>> > > > > > > > > > > > > > >
