Hi Weihua, Thanks for driving this. As Xintong mentioned, this was a technical debt from FLIP-56.
The latest version of FLIP sounds good, +1 from my side. As a contributor to this component, I'm willing to assist with the review process. Feel free to reach me if you need help. Best, Yangze Guo On Tue, Mar 7, 2023 at 1:47 PM Weihua Hu <huweihua....@gmail.com> wrote: > > Hi, > > @David @Matthias > There are a few days after hearing your thoughts. I would like to know if > there are any other concerns about this FLIP. > > > Best, > Weihua > > > On Mon, Mar 6, 2023 at 7:53 PM Weihua Hu <huweihua....@gmail.com> wrote: > > > > > Thanks Shammon, > > > > I've updated FLIP to add this redundant Task Manager limitation. > > > > > > Best, > > Weihua > > > > > > On Mon, Mar 6, 2023 at 5:07 PM Shammon FY <zjur...@gmail.com> wrote: > > > >> Hi weihua > >> > >> Can you add content related to `heterogeneous resources` to this FLIP? We > >> can record it and consider it in the future. It may be useful for some > >> scenarios, such as the combination of streaming and ML. > >> > >> Best, > >> Shammon > >> > >> > >> On Mon, Mar 6, 2023 at 1:39 PM weijie guo <guoweijieres...@gmail.com> > >> wrote: > >> > >> > Hi Weihua, > >> > > >> > Thanks for your clarification, SGTM. > >> > > >> > Best regards, > >> > > >> > Weijie > >> > > >> > > >> > Weihua Hu <huweihua....@gmail.com> 于2023年3月6日周一 11:43写道: > >> > > >> > > Thanks Weijie. > >> > > > >> > > Heterogeneous task managers will not be considered in this FLIP since > >> > > it does not request heterogeneous resources as you said. > >> > > > >> > > My first thought is we can adjust the meaning of redundant > >> configuration > >> > > to redundant number of per resource type. These can be considered in > >> > > detail when we decide to support heterogeneous task managers. > >> > > > >> > > Best, > >> > > Weihua > >> > > > >> > > > >> > > On Sat, Mar 4, 2023 at 1:13 AM weijie guo <guoweijieres...@gmail.com> > >> > > wrote: > >> > > > >> > > > Thanks Weihua for preparing this FLIP. > >> > > > > >> > > > This FLIP overall looks reasonable to me after updating as > >> suggested by > >> > > > Matthias. > >> > > > > >> > > > I only have one small question about keeping some redundant task > >> > > managers: > >> > > > In the fine-grained resource management, theoretically, it can > >> support > >> > > > heterogeneous taskmanagers. When we complete the missing features > >> for > >> > > FGSM, > >> > > > do we plan to take this into account? > >> > > > Of course, if I remember correctly, FGSM will not request > >> heterogeneous > >> > > > resources at present, so it is also acceptable to me if there is no > >> > > special > >> > > > treatment now. > >> > > > > >> > > > +1 for this changes if we can ensure the test coverage. > >> > > > > >> > > > Best regards, > >> > > > > >> > > > Weijie > >> > > > > >> > > > > >> > > > John Roesler <vvcep...@apache.org> 于2023年3月2日周四 12:53写道: > >> > > > > >> > > > > Thanks for the test plan, Weihua! > >> > > > > > >> > > > > Yes, it addresses my concerns. > >> > > > > > >> > > > > Thanks, > >> > > > > John > >> > > > > > >> > > > > On Wed, Mar 1, 2023, at 22:38, Weihua Hu wrote: > >> > > > > > Hi, everyone, > >> > > > > > Thanks for your suggestions and ideas. > >> > > > > > Thanks Xintong for sharing the detailed backgrounds of > >> SlotManager. > >> > > > > > > >> > > > > > *@Matthias > >> > > > > > > >> > > > > > 1. Did you do a proper test coverage analysis? > >> > > > > > > >> > > > > > > >> > > > > > Just as Xintong said, we already have a CI stage for fine > >> grained > >> > > > > resource > >> > > > > > managers. > >> > > > > > And I will make sure FineGrainedSlotManager as the default > >> > > SlotManager > >> > > > > can > >> > > > > > pass all the tests of CI. > >> > > > > > In addition, I will review all unit tests of > >> > > > DeclarativeSlotManager(DSM) > >> > > > > to > >> > > > > > ensure that there are no gaps in the > >> > > > > > coverage provided by the FineGrainedSlotManager. > >> > > > > > I also added the 'Test Plan' part to the FLIP. > >> > > > > > @Matthias @John @Shammon Does this test plan address your > >> concerns? > >> > > > > > > >> > > > > > 2. DeclarativeSlotManager and FineGrainedSlotManager feel quite > >> > big > >> > > in > >> > > > > > > >> > > > > > terms of lines of code.... > >> > > > > > > >> > > > > > > >> > > > > > IMO, the refactoring of SlotManager does not belong to this FLIP > >> > > since > >> > > > it > >> > > > > > may lead to some unstable risks. For > >> > > > > > FineGrainedSlotManager(FGSM), we already split some reasonable > >> > > > > components. > >> > > > > > They are: > >> > > > > > * TaskManagerTracker: Track task managers and their resources. > >> > > > > > * ResourceTracker: track requirements of jobs > >> > > > > > * ResourceAllocationStrategy: Try to fulfill the resource > >> > > requirements > >> > > > > with > >> > > > > > available/pending resources. > >> > > > > > * SlotStatusSyncer: communicate with TaskManager, for > >> > > > allocating/freeing > >> > > > > > slot and reconciling the slot status > >> > > > > > Maybe we can start a discussion about refactoring SlotManager in > >> > > > another > >> > > > > > FLIP if there are some good suggestions. > >> > > > > > WDYT > >> > > > > > > >> > > > > > 3. For me personally, having a more detailed summary comparing > >> the > >> > > > > >> subcomponents of both SlotManager implementations with where > >> > > > > >> their functionality matches and where they differ might help > >> > > > understand > >> > > > > the > >> > > > > >> consequences of the changes proposed in FLIP-298 > >> > > > > > > >> > > > > > Good suggestion, I have updated the comparison in this FLIP. > >> > Looking > >> > > > > > forward to any suggestions/thoughts > >> > > > > > if they are not described clearly. > >> > > > > > > >> > > > > > *@John > >> > > > > > > >> > > > > > 4. In addition to changing the default, would it make sense to > >> log > >> > a > >> > > > > >> deprecation warning on initialization > >> > > > > > > >> > > > > > if the DeclarativeSlotManager is used? > >> > > > > >> > >> > > > > > SGTM, We should add Deprecated annotations to DSM for devs. And > >> > log a > >> > > > > > deprecation warning for users. > >> > > > > > > >> > > > > > *@Shammon > >> > > > > > > >> > > > > > 1. For their functional differences, can you give some detailed > >> > tests > >> > > > to > >> > > > > >> verify that the new FineGrainedSlotManager has these > >> capabilities? > >> > > > This > >> > > > > can > >> > > > > >> effectively verify the new functions > >> > > > > >> > >> > > > > > As just maintained, there is already a CI stage of FGSM, and I > >> will > >> > > do > >> > > > > more > >> > > > > > review of unit tests for DSM. > >> > > > > > > >> > > > > > 2. I'm worried that many functions are not independent and it > >> is > >> > > > > difficult > >> > > > > >> to migrate step-by-step. You can list the relationship between > >> > them > >> > > in > >> > > > > >> detail. > >> > > > > > > >> > > > > > As Xintong saied the DSM is a subset of FGSM by design. But as > >> > time > >> > > > goes > >> > > > > > on, FGSM has some lacking > >> > > > > > functions as I listed in this FLIP. And I have added the > >> comparison > >> > > > > between > >> > > > > > DSM and FGSM in this FLIP. > >> > > > > > > >> > > > > > > >> > > > > > Thanks again for all your thoughts. Any feedback is appreciated! > >> > > > > > > >> > > > > > Best, > >> > > > > > Weihua > >> > > > > > > >> > > > > > > >> > > > > > On Wed, Mar 1, 2023 at 2:17 PM Xintong Song < > >> tonysong...@gmail.com > >> > > > >> > > > > wrote: > >> > > > > > > >> > > > > >> Thanks Weihua for preparing this FLIP. +1 for the proposal. > >> > > > > >> > >> > > > > >> > >> > > > > >> As one of the contributors of the fine-grained slot manager, > >> I'd > >> > > like > >> > > > to > >> > > > > >> share some backgrounds here. > >> > > > > >> > >> > > > > >> - There used to be a defaut slot manager implementation, which > >> is > >> > > > > >> non-declarative and has been removed now. The two features, > >> > > > declarative > >> > > > > / > >> > > > > >> reactive resource management and fine-grained resource > >> management, > >> > > > were > >> > > > > >> proposed at about the same time. We were aware that by design > >> the > >> > > > > >> declarative slot manager is a subset of fine-grained slot > >> manager > >> > at > >> > > > > that > >> > > > > >> time, but still decided to implement two slot managers for the > >> > > purpose > >> > > > > of > >> > > > > >> decoupling efforts and reducing cross-team synchronization > >> > overhead. > >> > > > > >> Merging the two slot managers once they are proved stable is > >> IMO a > >> > > > > >> technical debt that was planned at the very beginning. > >> > > > > >> > >> > > > > >> - The FineGrainedSlotManager has been verified in Alibaba's > >> > internal > >> > > > > >> production as well as Alibaba Cloud services as the default > >> slot > >> > > > manager > >> > > > > >> for about 2 years. > >> > > > > >> > >> > > > > >> > >> > > > > >> Concerning test cases, we currently have a ci stage for fine > >> > grained > >> > > > > >> resource management. To avoid adding too much burden, the stage > >> > only > >> > > > > >> includes tests from flink-runtime and flink-test modules. I > >> think > >> > > > > switching > >> > > > > >> the default slot manager and applying the whole set of tests on > >> > the > >> > > > > >> fine-grained slot manager would help us to be more confident > >> about > >> > > it. > >> > > > > >> > >> > > > > >> > >> > > > > >> Concerning cutting out functionalities out of slot manager, I > >> > think > >> > > > > Yangze > >> > > > > >> and I have tried our best to shape the FineGrainedSlotManager > >> into > >> > > > > >> reasonable components. I personally don't have other ideas to > >> > > further > >> > > > > >> disassemble the component, but I'm open to such suggestions. > >> > > However, > >> > > > > from > >> > > > > >> the stability perspective, I'd be in favor of not introducing > >> > > > > significant > >> > > > > >> changes to the FineGrainedSlotManager while switching it to the > >> > > > default. > >> > > > > >> Because the current implementation has already been verified > >> (or > >> > at > >> > > > > least > >> > > > > >> partially verified because Alibaba does not cover all the Flink > >> > use > >> > > > > cases), > >> > > > > >> and introducing more changes also means more chances of > >> breaking > >> > > > things. > >> > > > > >> > >> > > > > >> > >> > > > > >> Best, > >> > > > > >> > >> > > > > >> Xintong > >> > > > > >> > >> > > > > >> > >> > > > > >> > >> > > > > >> On Wed, Mar 1, 2023 at 11:12 AM Shammon FY <zjur...@gmail.com> > >> > > wrote: > >> > > > > >> > >> > > > > >> > Hi > >> > > > > >> > > >> > > > > >> > Thanks for starting this work weihua, I think unifying > >> > > > > >> > DeclarativeSlotManager and FineGrainedSlotManager is > >> valuable. > >> > > > > >> > > >> > > > > >> > I agree with @Matthias and @John that we need a way to ensure > >> > that > >> > > > > >> > DeclarativeSlotManager's capabilities are fully covered by > >> > > > > >> > FineGrainedSlotManager > >> > > > > >> > > >> > > > > >> > 1. For their functional differences, can you give some > >> detailed > >> > > > tests > >> > > > > to > >> > > > > >> > verify that the new FineGrainedSlotManager has these > >> > capabilities? > >> > > > > This > >> > > > > >> can > >> > > > > >> > effectively verify the new functions > >> > > > > >> > > >> > > > > >> > 2. I'm worried that many functions are not independent and > >> it is > >> > > > > >> difficult > >> > > > > >> > to migrate step-by-step. You can list the relationship > >> between > >> > > them > >> > > > in > >> > > > > >> > detail. > >> > > > > >> > > >> > > > > >> > 3. As John mentioned, give a smoke test for > >> > FineGrainedSlotManager > >> > > > is > >> > > > > a > >> > > > > >> > good idea. Or you can add some test information to the > >> > > > > >> > DeclarativeSlotManager to determine how many tests have used > >> it. > >> > > In > >> > > > > this > >> > > > > >> > way, we can gradually construct test cases for > >> > > > FineGrainedSlotManager > >> > > > > >> > during the development process. > >> > > > > >> > > >> > > > > >> > > >> > > > > >> > Best, > >> > > > > >> > Shammon > >> > > > > >> > > >> > > > > >> > > >> > > > > >> > On Tue, Feb 28, 2023 at 10:22 PM John Roesler < > >> > > vvcep...@apache.org> > >> > > > > >> wrote: > >> > > > > >> > > >> > > > > >> > > Thanks for the FLIP, Weihua! > >> > > > > >> > > > >> > > > > >> > > I’ve read the FLIP, and it sounds good to me. We need to > >> avoid > >> > > > > >> > > proliferating alternative implementations wherever > >> possible. I > >> > > > have > >> > > > > >> just > >> > > > > >> > a > >> > > > > >> > > couple of comments: > >> > > > > >> > > > >> > > > > >> > > 1. I share Matthias’s concern about ensuring the behavior > >> is > >> > > > really > >> > > > > the > >> > > > > >> > > same. One suggestion I’ve used for this kind of thing is, > >> as a > >> > > > smoke > >> > > > > >> > test, > >> > > > > >> > > to update the DeclarativeSlotManager to just delegate to > >> the > >> > > > > >> > > FineGrainedSlotManager. If the full test suite still > >> passes, > >> > you > >> > > > > can be > >> > > > > >> > > pretty sure the new default is really ok. It would not be a > >> > good > >> > > > > idea > >> > > > > >> to > >> > > > > >> > > actually keep that in for the release, since it would > >> remove > >> > the > >> > > > > option > >> > > > > >> > to > >> > > > > >> > > fall back in case of bugs. Either way, we need to make sure > >> > all > >> > > > test > >> > > > > >> > > scenarios are present for the FGSM. > >> > > > > >> > > > >> > > > > >> > > 4. In addition to changing the default, would it make > >> sense to > >> > > > log a > >> > > > > >> > > deprecation warning on initialization if the > >> > > > DeclarativeSlotManager > >> > > > > is > >> > > > > >> > used? > >> > > > > >> > > > >> > > > > >> > > Thanks again, > >> > > > > >> > > John > >> > > > > >> > > > >> > > > > >> > > On Tue, Feb 28, 2023, at 07:20, Matthias Pohl wrote: > >> > > > > >> > > > Hi Weihua, > >> > > > > >> > > > Thanks for your proposal. From a conceptual point: AFAIU, > >> > the > >> > > > > >> > > > DeclarativeSlotManager covers a subset (i.e. only evenly > >> > sized > >> > > > > slots) > >> > > > > >> > of > >> > > > > >> > > > what the FineGrainedSlotManager should be able to achieve > >> > > > > (variable > >> > > > > >> > slot > >> > > > > >> > > > size per task manager). Is this the right > >> > > > > assumption/understanding? > >> > > > > >> In > >> > > > > >> > > this > >> > > > > >> > > > sense, merging both implementations into a single one > >> sounds > >> > > > > good. A > >> > > > > >> > few > >> > > > > >> > > > more general comments, though: > >> > > > > >> > > > > >> > > > > >> > > > 1. Did you do a proper test coverage analysis? That's not > >> > > > > mentioned > >> > > > > >> in > >> > > > > >> > > the > >> > > > > >> > > > current version of the FLIP. I'm bringing this up > >> because we > >> > > ran > >> > > > > into > >> > > > > >> > the > >> > > > > >> > > > same issue when fixing the flaws that popped up after > >> > > > introducing > >> > > > > the > >> > > > > >> > > > multi-component leader election (see FLIP-285 [1]). There > >> > is a > >> > > > > risk > >> > > > > >> > that > >> > > > > >> > > by > >> > > > > >> > > > removing the legacy code we decrease test coverage > >> because > >> > > > certain > >> > > > > >> > > > test cases that were covered for the legacy classes might > >> > not > >> > > be > >> > > > > >> > > > necessarily covered in the new implementation, yet (see > >> > > > > FLINK-30338 > >> > > > > >> [2] > >> > > > > >> > > > which covers this issue for the leader election case). > >> > > Ideally, > >> > > > we > >> > > > > >> > don't > >> > > > > >> > > > want to remove test cases accidentally because they were > >> > only > >> > > > > >> > implemented > >> > > > > >> > > > for the DeclarativeSlotManager but missed for the > >> > > > > >> > FineGrainedSlotManager. > >> > > > > >> > > > > >> > > > > >> > > > 2. DeclarativeSlotManager and FineGrainedSlotManager feel > >> > > quite > >> > > > > big > >> > > > > >> in > >> > > > > >> > > > terms of lines of code. Without knowing whether it's > >> > actually > >> > > a > >> > > > > >> > > reasonable > >> > > > > >> > > > thing to do: Instead of just adding more features to the > >> > > > > >> > > > FineGrainedSlotManager, have you thought of cutting out > >> > > > > functionality > >> > > > > >> > > into > >> > > > > >> > > > smaller sub-components along this refactoring? Such a > >> > > > step-by-step > >> > > > > >> > > approach > >> > > > > >> > > > might improve the overall codebase and might make > >> reviewing > >> > > the > >> > > > > >> > > refactoring > >> > > > > >> > > > easier. I did a first pass over the code and struggled to > >> > > > identify > >> > > > > >> code > >> > > > > >> > > > blocks that could be moved out of the SlotManager > >> > > > > implementation(s). > >> > > > > >> > > > Therefore, I might be wrong with this proposal. I haven't > >> > > worked > >> > > > > on > >> > > > > >> > this > >> > > > > >> > > > codebase in that detail that it would allow me to come up > >> > > with a > >> > > > > >> > > judgement > >> > > > > >> > > > call. I wanted to bring it up, anyway, because I'm > >> curious > >> > > > whether > >> > > > > >> that > >> > > > > >> > > > could be an option. There's a comment created by Chesnay > >> > > (CC'd) > >> > > > in > >> > > > > >> the > >> > > > > >> > > > JavaDoc of TaskExecutorManager [3] indicating something > >> > > similar. > >> > > > > I'm > >> > > > > >> > > > wondering whether he can add some insights here. > >> > > > > >> > > > > >> > > > > >> > > > 3. For me personally, having a more detailed summary > >> > comparing > >> > > > the > >> > > > > >> > > > subcomponents of both SlotManager implementations with > >> where > >> > > > > >> > > > their functionality matches and where they differ might > >> help > >> > > > > >> understand > >> > > > > >> > > the > >> > > > > >> > > > consequences of the changes proposed in FLIP-298. > >> > > > > >> > > > > >> > > > > >> > > > Best, > >> > > > > >> > > > Matthias > >> > > > > >> > > > > >> > > > > >> > > > [1] > >> > > > > >> > > > > >> > > > > >> > > > >> > > > > >> > > >> > > > > >> > >> > > > > > >> > > > > >> > > > >> > > >> https://cwiki.apache.org/confluence/display/FLINK/FLIP-285%3A+Refactoring+LeaderElection+to+make+Flink+support+multi-component+leader+election+out-of-the-box > >> > > > > >> > > > [2] https://issues.apache.org/jira/browse/FLINK-30338 > >> > > > > >> > > > [3] > >> > > > > >> > > > > >> > > > > >> > > > >> > > > > >> > > >> > > > > >> > >> > > > > > >> > > > > >> > > > >> > > >> https://github.com/apache/flink/blob/f611ea8cb5deddb42429df2c99f0c68d7382e9bd/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/slotmanager/TaskExecutorManager.java#L66-L68 > >> > > > > >> > > > > >> > > > > >> > > > On Tue, Feb 28, 2023 at 6:14 AM Matt Wang < > >> wang...@163.com> > >> > > > > wrote: > >> > > > > >> > > > > >> > > > > >> > > >> This is a good proposal for me, it will make the code of > >> > the > >> > > > > >> > SlotManager > >> > > > > >> > > >> more clear. > >> > > > > >> > > >> > >> > > > > >> > > >> > >> > > > > >> > > >> > >> > > > > >> > > >> -- > >> > > > > >> > > >> > >> > > > > >> > > >> Best, > >> > > > > >> > > >> Matt Wang > >> > > > > >> > > >> > >> > > > > >> > > >> > >> > > > > >> > > >> ---- Replied Message ---- > >> > > > > >> > > >> | From | David Morávek<d...@apache.org> | > >> > > > > >> > > >> | Date | 02/27/2023 22:45 | > >> > > > > >> > > >> | To | <dev@flink.apache.org> | > >> > > > > >> > > >> | Subject | Re: [DISCUSS] FLIP-298: Unifying the > >> > > Implementation > >> > > > > of > >> > > > > >> > > >> SlotManager | > >> > > > > >> > > >> Hi Weihua, I still need to dig into the details, but the > >> > > > overall > >> > > > > >> > > sentiment > >> > > > > >> > > >> of this change sounds reasonable. > >> > > > > >> > > >> > >> > > > > >> > > >> Best, > >> > > > > >> > > >> D. > >> > > > > >> > > >> > >> > > > > >> > > >> On Mon, Feb 27, 2023 at 2:26 PM Zhanghao Chen < > >> > > > > >> > > zhanghao.c...@outlook.com> > >> > > > > >> > > >> wrote: > >> > > > > >> > > >> > >> > > > > >> > > >> Thanks for driving this topic. I think this FLIP could > >> help > >> > > > > clean up > >> > > > > >> > the > >> > > > > >> > > >> codebase to make it easier to maintain. +1 on it. > >> > > > > >> > > >> > >> > > > > >> > > >> Best, > >> > > > > >> > > >> Zhanghao Chen > >> > > > > >> > > >> ________________________________ > >> > > > > >> > > >> From: Weihua Hu <huweihua....@gmail.com> > >> > > > > >> > > >> Sent: Monday, February 27, 2023 20:40 > >> > > > > >> > > >> To: dev <dev@flink.apache.org> > >> > > > > >> > > >> Subject: [DISCUSS] FLIP-298: Unifying the > >> Implementation of > >> > > > > >> > SlotManager > >> > > > > >> > > >> > >> > > > > >> > > >> Hi everyone, > >> > > > > >> > > >> > >> > > > > >> > > >> I would like to begin a discussion on FLIP-298: Unifying > >> > the > >> > > > > >> > > Implementation > >> > > > > >> > > >> of SlotManager[1]. There are currently two types of > >> > > SlotManager > >> > > > > in > >> > > > > >> > > Flink: > >> > > > > >> > > >> DeclarativeSlotManager and FineGrainedSlotManager. > >> > > > > >> > > FineGrainedSlotManager > >> > > > > >> > > >> should behave as DeclarativeSlotManager if the user does > >> > not > >> > > > > >> configure > >> > > > > >> > > the > >> > > > > >> > > >> slot request profile. > >> > > > > >> > > >> > >> > > > > >> > > >> Therefore, this FLIP aims to unify the implementation of > >> > > > > SlotManager > >> > > > > >> > in > >> > > > > >> > > >> order to reduce maintenance costs. > >> > > > > >> > > >> > >> > > > > >> > > >> Looking forward to hearing from you. > >> > > > > >> > > >> > >> > > > > >> > > >> [1] > >> > > > > >> > > >> > >> > > > > >> > > >> > >> > > > > >> > > >> > >> > > > > >> > > > >> > > > > >> > > >> > > > > >> > >> > > > > > >> > > > > >> > > > >> > > >> https://cwiki.apache.org/confluence/display/FLINK/FLIP-298%3A+Unifying+the+Implementation+of+SlotManager > >> > > > > >> > > >> > >> > > > > >> > > >> Best, > >> > > > > >> > > >> Weihua > >> > > > > >> > > >> > >> > > > > >> > > >> > >> > > > > >> > > > >> > > > > >> > > >> > > > > >> > >> > > > > > >> > > > > >> > > > >> > > >> > >