Hi, Based on testing that was performed around gang scheduling and the spark operator by Bowen Li and Chaoran Yu we found that the behaviour around the operator was far from optimal. YUNIKORN-558 <https://issues.apache.org/jira/browse/YUNIKORN-558> was logged to help with the integration. We did not put any development or test time into making sure the operator and gang scheduling worked. The behaviour that was observed was not linked to gang scheduling but to the generic way the operator implementation works in YuniKorn.
The current Spark operator, implemented in pkg/appmgmt/sparkoperator/spark.go, listens to the Spark CRD add/update/delete. Each CRD is then converted into an application inside YuniKorn and processed. The pods created by the Spark operator form the other half of the application. However the CRD has its own application ID. The application ID for the Spark pods (drivers and executors) is different. This leaves us with two applications in the system: one without pods (CRD based) and one with pods (the real workload). The real workload pods have an owner reference set to the CRD. Having two applications for one real workload is strange. It does not work correctly in the UI and gives all kinds of issues on completion and recovery on restart. The proposal is now to merge the two objects into one application inside YuniKorn. The CRD can still be used to track updates and provide events for scheduling etc. The "ApplicationID" set in the driver or executor pods should be used to track this application. The owner reference allows linking the real pods back to the CRD. The CRD will be used to provide the life cycle tracking and as an event collector. All these changes do require rework on the app management side. I hope the proposal sounds like the correct way forward. This same CRD based mechanism also seems to fit in with the way the flink operator works. Please provide some feedback on this proposal. Implementation would require changes in app management and related unit tests. Recovery and gang scheduling tests should also be covered under this change. Wilfred
