Hi! In my opinion this is the right way as well. Also the App CRD is already following this pattern: it`s controller is handling the CRD related events, and every Pod related event is handled by the general operator. Related to the status of an application: right now, the source of truth is the core side, since the core decides when the app is accepted, started, running, etc.
Regards, Kinga On Tue, Mar 30, 2021 at 6:20 AM Chaoran Yu <[email protected]> wrote: > Thanks Wilfred for the proposal. > > I agree with the overall approach. To summarize the desired > responsibilities of a generic 3rd party app management plugin (only the > Spark operator plugin for now), combining with what Weiwei said, > > * It will only react to lifecycle events (Add, Update, Delete etc) for its > CRD objects and it will not react to lifecycle events for pods, which will > be left for the general plugin to take care of. > * For the same underlying workload (e.g. a Spark job), the 3rd party plugin > should see the same app ID as the general plugin (i.e. no two plugins > should think of the same workload as two different apps) > * The state of an application will be determined by the 3rd party plugin > (e.g. the Spark operator plugin, not the general plugin, will determine the > current state of a Spark job) > > Chaoran > > > > On Mon, Mar 29, 2021 at 1:55 PM Weiwei Yang <[email protected]> wrote: > > > Hi Wilfred > > > > The original idea was to have each app mgmt plugin, e.g spark operator > > plugin, manage the certain type of app's lifecycle independently. > > That means each pod on K8s will only be seen and monitored by one app > mgmt > > plugin. The problems we found earlier were because it goes > > against this idea, that both the general and spark plugin reacts on the > > same set of Spark pods. This is a bit different than your proposal, > > could you please take a look? > > > > > > On Mon, Mar 29, 2021 at 3:34 AM Wilfred Spiegelenburg < > [email protected] > > > > > wrote: > > > > > Hi, > > > > > > Based on testing that was performed around gang scheduling and the > spark > > > operator by Bowen Li and Chaoran Yu we found that the behaviour around > > the > > > operator was far from optimal. YUNIKORN-558 > > > <https://issues.apache.org/jira/browse/YUNIKORN-558> was logged to > help > > > with the integration. > > > We did not put any development or test time into making sure the > operator > > > and gang scheduling worked. The behaviour that was observed was not > > linked > > > to gang scheduling but to the generic way the operator implementation > > works > > > in YuniKorn. > > > > > > The current Spark operator, implemented > > > in pkg/appmgmt/sparkoperator/spark.go, listens to the Spark CRD > > > add/update/delete. Each CRD is then converted into an application > inside > > > YuniKorn and processed. The pods created by the Spark operator form the > > > other half of the application. However the CRD has its own application > > ID. > > > The application ID for the Spark pods (drivers and executors) is > > different. > > > > > > This leaves us with two applications in the system: one without pods > (CRD > > > based) and one with pods (the real workload). The real workload pods > have > > > an owner reference set to the CRD. Having two applications for one real > > > workload is strange. It does not work correctly in the UI and gives all > > > kinds of issues on completion and recovery on restart. > > > > > > The proposal is now to merge the two objects into one application > inside > > > YuniKorn. The CRD can still be used to track updates and provide events > > for > > > scheduling etc. The "ApplicationID" set in the driver or executor pods > > > should be used to track this application. > > > The owner reference allows linking the real pods back to the CRD. The > CRD > > > will be used to provide the life cycle tracking and as an event > > collector. > > > > > > All these changes do require rework on the app management side. I hope > > the > > > proposal sounds like the correct way forward. This same CRD based > > mechanism > > > also seems to fit in with the way the flink operator works. > > > Please provide some feedback on this proposal. Implementation would > > require > > > changes in app management and related unit tests. Recovery and gang > > > scheduling tests should also be covered under this change. > > > > > > Wilfred > > > > > >
