After discussing this FLIP-334[1] offline with Gyula and Max, I updated the FLIP based on the latest conclusion.
Big thanks to Gyula and Max for their professional advice! > Does the interface function of handlerRecommendedParallelism > in AutoScalerEventHandler conflict with > handlerScalingFailure/handlerScalingReport (one of the > handles the event of scale failure, and the other handles > the event of scale success). Hi Matt, You can take a look at the FLIP, I think the issue has been fixed. Currently, we introduced the ScalingRealizer and AutoScalerEventHandler interface. The ScalingRealizer handles scaling action. - The AutoScalerEventHandler interface handles loggable events. Looking forward to your feedback, thanks! [1] https://cwiki.apache.org/confluence/x/x4qzDw Best, Rui On Thu, Aug 24, 2023 at 10:55 AM Matt Wang <wang...@163.com> wrote: > Sorry for the late reply, I still have a small question here: > Does the interface function of handlerRecommendedParallelism > in AutoScalerEventHandler conflict with > handlerScalingFailure/handlerScalingReport (one of the > handles the event of scale failure, and the other handles > the event of scale success). > > > > -- > > Best, > Matt Wang > > > ---- Replied Message ---- > | From | Rui Fan<1996fan...@gmail.com> | > | Date | 08/21/2023 17:41 | > | To | <dev@flink.apache.org> | > | Cc | Maximilian Michels<m...@apache.org> , > Gyula Fóra<gyula.f...@gmail.com> , > Matt Wang<wang...@163.com> | > | Subject | Re: [DISCUSS] FLIP-334 : Decoupling autoscaler and kubernetes | > Hi Max, Gyula and Matt, > > Do you have any other comments? > > The flink-kubernetes-operator 1.6 has been released recently, > it's a good time to kick off this FLIP. > > Please let me know if you have any questions or concerns, > looking forward to your feedback, thanks! > > Best, > Rui > > On Wed, Aug 9, 2023 at 11:55 AM Rui Fan <1996fan...@gmail.com> wrote: > > Hi Matt Wang, > > Thanks for your discussion here. > > it is recommended to unify the descriptions of AutoScalerHandler > and AutoScalerEventHandler in the FLIP > > Good catch, I have updated all AutoScalerHandler to > AutoScalerEventHandler. > > Can it support the use of zookeeper (zookeeper is a relatively > common use of flink HA)? > > In my opinion, it's a good suggestion. However, I prefer we > implement other state stores in the other FLINK JIRA, and > this FLIP focus on the decoupling and implementing the > necessary state store. Does that make sense? > > Regarding each scaling information, can it be persisted in > the shared file system through the filesystem? I think it will > be a more valuable requirement to support viewing > Autoscaling info on the UI in the future, which can provide > some foundations in advance; > > This is a good suggestion as well. It's useful for users to check > the scaling information. I propose to add a CompositeEventHandler, > it can include multiple EventHandlers. > > However, as the last question, I prefer we implement other > event handler in the other FLINK JIRA. What do you think? > > A solution mentioned in FLIP is to initialize the > AutoScalerEventHandler object every time an event is > processed. > > No, the FLIP mentioned `The AutoScalerEventHandler object is shared for > all flink jobs`, > So the AutoScalerEventHandler is only initialized once. > > And we call the AutoScalerEventHandler#handlerXXX > every time an event is processed. > > Best, > Rui > > On Tue, Aug 8, 2023 at 9:40 PM Matt Wang <wang...@163.com> wrote: > > Hi Rui > > Thanks for driving the FLIP. > > I agree with the point fo this FLIP. This FLIP first provides a > general function of Autoscaler in Flink repo, and there is no > need to move kubernetes-autoscaler from kubernetes-operator > to Flink repo in this FLIP(it is recommended to unify the > descriptions of AutoScalerHandler and AutoScalerEventHandler > in the FLIP). Here I still have a few questions: > > 1. AutoScalerStateStore mainly records the state information > during Scaling. In addition to supporting the use of configmap, > can it support the use of zookeeper (zookeeper is a relatively > common use of flink HA)? > 2. Regarding each scaling information, can it be persisted in > the shared file system through the filesystem? I think it will > be a more valuable requirement to support viewing > Autoscaling info on the UI in the future, which can provide > some foundations in advance; > 3. A solution mentioned in FLIP is to initialize the > AutoScalerEventHandler object every time an event is > processed. What is the main purpose of this solution? > > > > -- > > Best, > Matt Wang > > > ---- Replied Message ---- > | From | Rui Fan<1996fan...@gmail.com> | > | Date | 08/7/2023 11:34 | > | To | <dev@flink.apache.org> | > | Cc | m...@apache.org<m...@apache.org> , > Gyula Fóra<gyula.f...@gmail.com> | > | Subject | Re: [DISCUSS] FLIP-334 : Decoupling autoscaler and kubernetes > | > Hi Ron: > > Thanks for the feedback! The goal is indeed to turn the autoscaler into > a general tool that can support other resource management. > > > Hi Max, Gyula: > > My proposed `AutoScalerStateStore` is similar to Map, it can really be > improved. > > public interface AutoScalerStateStore { > Map<String, String> getState(KEY jobKey) > void updateState(KEY jobKey, Map<String, String> state); > } > > From the method parameter, the StateStore is shared by all jobs, right? > If yes, the `KEY jobKey` isn't enough because the CR is needed during > creating the ConfigMap. The `jobKey` is ResourceID, the extraJobInfo is > CR. > > So, this parameter may need to be changed from `KEY jobKey` to > `JobAutoScalerContext<KEY, INFO>`. Does that make sense? > If yes, I can update the interface in the FLIP doc. > > Best, > Rui > > On Mon, Aug 7, 2023 at 10:18 AM liu ron <ron9....@gmail.com> wrote: > > Hi, Rui > > Thanks for driving the FLIP. > > The tuning of streaming jobs by autoscaler is very important. Although the > mainstream trend now is cloud-native, many companies still run their Flink > jobs on Yarn for historical reasons. If we can decouple autoscaler from > K8S > and turn it into a common tool that can support other resource management > frameworks such as Yarn, I think it will be very meaningful. > +1 for this proposal. > > Best, > Ron > > > Gyula Fóra <gyula.f...@gmail.com> 于2023年8月5日周六 15:03写道: > > Hi Rui! > > Thanks for the proposal. > > I agree with Max on that the state store abstractions could be improved > and > be more specific as we know what goes into the state. It could simply be > > public interface AutoScalerStateStore { > Map<String, String> getState(KEY jobKey) > void updateState(KEY jobKey, Map<String, String> state); > } > > > We could also remove the entire recommended parallelism logic from the > interface and make it internal to the implementation somehow because it's > not very nice in the current form. > > Cheers, > Gyula > > On Fri, Aug 4, 2023 at 7:05 AM Rui Fan <1996fan...@gmail.com> wrote: > > Hi Max, > > After careful consideration, I prefer to keep the AutoScalerStateStore > instead of AutoScalerEventHandler taking over the work of > AutoScalerStateStore. And the following are some reasons: > > 1. Keeping the AutoScalerStateStore to make StateStore easy to plug in. > > Currently, the kubernetes-operator-autoscaler uses the ConfigMap as the > state store. However, users may use a different state store for > yarn-autoscaler or generic autoscaler. Such as: MySQL StateStore, > Heaped StateStore or PostgreSQL StateStore, etc. > > Of course, kubernetes autoscaler can also use the MySQL StateStore. > If the AutoScalerEventHandler is responsible for recording events, > scaling job and accessing state, whenever users or community want to > create a new state store, they must also implement the new > AutoScalerEventHandler, it includes recording events and scaling job. > > If we decouple AutoScalerEventHandler and AutoScalerStateStore, > it's easy to implement a new state store. > > 2. AutoScalerEventHandler isn't suitable for access state. > > Sometimes the generic autoscaler needs to query state, > AutoScalerEventHandler can update the state during handling events. > However, it's wired to provide a series of interfaces to query state. > > What do you think? > > And looking forward to more thoughts from the community, thanks! > > Best, > Rui Fan > > On Tue, Aug 1, 2023 at 11:47 PM Rui Fan <1996fan...@gmail.com> wrote: > > Hi Max, > > Thanks for your quick response! > > 1. Handle state in the AutoScalerEventHandler which will receive > all related scaling and metric collection events, and can keep > track of any state. > > If I understand correctly, you mean that updating state is just part > of > handling events, right? > > If yes, sounds make sense. However, I have some concerns: > > - Currently, we have 3 key-values that need to be stored. And the > autoscaler needs to get them first, then update them, and sometimes > remove them. If we use AutoScalerEventHandler, we need to provided > 9 methods, right? Every key has 3 methods. > - Do we add the persistState interface for AutoScalerEventHandler to > persist in-memory state to kubernetes? > > 2. In the long run, the autoscaling logic can move to a > separate repository, although this will complicate the release > process, so I would defer this unless there is strong interest. > > I also agree to leave it in flink-k8s-operator for now. Unless moving > it > out of flink-k8s-operator is necessary in the future. > > 3. The proposal mentions some removal of tests. > > Sorry, I didn't express clearly in FLIP. POC just check whether these > interfaces work well. It will take more time if I develop all the > tests > during POC. So I removed these tests in my POC. > > These tests will be completed in the final PR, and the test is very > useful > for less bugs. > > Best, > Rui Fan > > On Tue, Aug 1, 2023 at 10:10 PM Maximilian Michels <m...@apache.org> > wrote: > > Hi Rui, > > Thanks for the proposal. I think it makes a lot of sense to decouple > the autoscaler from Kubernetes-related dependencies. A couple of > notes > when I read the proposal: > > 1. You propose AutoScalerEventHandler, AutoScalerStateStore, > AutoScalerStateStoreFactory, and AutoScalerEventHandler. > AutoscalerStateStore is a generic key/value database (methods: > "get"/"put"/"delete"). I would propose to refine this interface and > make it less general purpose, e.g. add a method for persisting > scaling > decisions as well as any metrics gathered for the current metric > window. For simplicity, I'd even go so far to remove the state store > entirely, but rather handle state in the AutoScalerEventHandler > which > will receive all related scaling and metric collection events, and > can > keep track of any state. > > 2. You propose to make the current autoscaler module > Kubernetes-agnostic by moving the Kubernetes parts into the main > operator module. I think that makes sense since the Kubernetes > implementation will continue to be tightly coupled with Kubernetes. > The goal of the separate module was to make the autoscaler logic > pluggable, but this will continue to be possible with the new > "flink-autoscaler" module which contains the autoscaling logic and > interfaces. In the long run, the autoscaling logic can move to a > separate repository, although this will complicate the release > process, so I would defer this unless there is strong interest. > > 3. The proposal mentions some removal of tests. It is critical for > us > that all test coverage of the current implementation remains active. > It is ok if some of the test coverage only covers the Kubernetes > implementation. We can eventually move more tests without Kubernetes > significance into the implementation-agnostic autoscaler tests. > > -Max > > On Tue, Aug 1, 2023 at 9:46 AM Rui Fan <fan...@apache.org> wrote: > > Hi all, > > I and Samrat(cc'ed) created the FLIP-334[1] to decoupling the > autoscaler > and kubernetes. > > Currently, the flink-autoscaler is tightly integrated with > Kubernetes. > There are compelling reasons to extend the use of flink-autoscaler > to > more types of Flink jobs: > 1. With the recent merge of the Externalized Declarative Resource > Management (FLIP-291[2]), in-place scaling is now supported > across all types of Flink jobs. This development has made scaling > Flink > on > YARN a straightforward process. > 2. Several discussions[3] within the Flink user community, as > observed > in > the mail list , have emphasized the necessity of flink-autoscaler > supporting > Flink on YARN. > > Please refer to the FLIP[1] document for more details about the > proposed > design and implementation. We welcome any feedback and opinions on > this proposal. > > [1] https://cwiki.apache.org/confluence/x/x4qzDw > [2] > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-291%3A+Externalized+Declarative+Resource+Management > [3] > https://lists.apache.org/thread/pr0r8hq8kqpzk3q1zrzkl3rp1lz24v7v > > > > > > >