Hi Wilfred The original idea was to have each app mgmt plugin, e.g spark operator plugin, manage the certain type of app's lifecycle independently. That means each pod on K8s will only be seen and monitored by one app mgmt plugin. The problems we found earlier were because it goes against this idea, that both the general and spark plugin reacts on the same set of Spark pods. This is a bit different than your proposal, could you please take a look?
On Mon, Mar 29, 2021 at 3:34 AM Wilfred Spiegelenburg <[email protected]> wrote: > Hi, > > Based on testing that was performed around gang scheduling and the spark > operator by Bowen Li and Chaoran Yu we found that the behaviour around the > operator was far from optimal. YUNIKORN-558 > <https://issues.apache.org/jira/browse/YUNIKORN-558> was logged to help > with the integration. > We did not put any development or test time into making sure the operator > and gang scheduling worked. The behaviour that was observed was not linked > to gang scheduling but to the generic way the operator implementation works > in YuniKorn. > > The current Spark operator, implemented > in pkg/appmgmt/sparkoperator/spark.go, listens to the Spark CRD > add/update/delete. Each CRD is then converted into an application inside > YuniKorn and processed. The pods created by the Spark operator form the > other half of the application. However the CRD has its own application ID. > The application ID for the Spark pods (drivers and executors) is different. > > This leaves us with two applications in the system: one without pods (CRD > based) and one with pods (the real workload). The real workload pods have > an owner reference set to the CRD. Having two applications for one real > workload is strange. It does not work correctly in the UI and gives all > kinds of issues on completion and recovery on restart. > > The proposal is now to merge the two objects into one application inside > YuniKorn. The CRD can still be used to track updates and provide events for > scheduling etc. The "ApplicationID" set in the driver or executor pods > should be used to track this application. > The owner reference allows linking the real pods back to the CRD. The CRD > will be used to provide the life cycle tracking and as an event collector. > > All these changes do require rework on the app management side. I hope the > proposal sounds like the correct way forward. This same CRD based mechanism > also seems to fit in with the way the flink operator works. > Please provide some feedback on this proposal. Implementation would require > changes in app management and related unit tests. Recovery and gang > scheduling tests should also be covered under this change. > > Wilfred >
