Hi!

In my opinion this is the right way as well. Also the App CRD is already
following this pattern: it`s controller is handling the CRD related events,
and every Pod related event is handled by the general operator.
Related to the status of an application: right now, the source of truth is
the core side, since the core decides when the app is accepted, started,
running, etc.

Regards,
Kinga

On Tue, Mar 30, 2021 at 6:20 AM Chaoran Yu <[email protected]> wrote:

> Thanks Wilfred for the proposal.
>
> I agree with the overall approach. To summarize the desired
> responsibilities of a generic 3rd party app management plugin (only the
> Spark operator plugin for now), combining with what Weiwei said,
>
> * It will only react to lifecycle events (Add, Update, Delete etc) for its
> CRD objects and it will not react to lifecycle events for pods, which will
> be left for the general plugin to take care of.
> * For the same underlying workload (e.g. a Spark job), the 3rd party plugin
> should see the same app ID as the general plugin (i.e. no two plugins
> should think of the same workload as two different apps)
> * The state of an application will be determined by the 3rd party plugin
> (e.g. the Spark operator plugin, not the general plugin, will determine the
> current state of a Spark job)
>
> Chaoran
>
>
>
> On Mon, Mar 29, 2021 at 1:55 PM Weiwei Yang <[email protected]> wrote:
>
> > Hi Wilfred
> >
> > The original idea was to have each app mgmt plugin, e.g spark operator
> > plugin, manage the certain type of app's lifecycle independently.
> > That means each pod on K8s will only be seen and monitored by one app
> mgmt
> > plugin. The problems we found earlier were because it goes
> > against this idea, that both the general and spark plugin reacts on the
> > same set of Spark pods. This is a bit different than your proposal,
> > could you please take a look?
> >
> >
> > On Mon, Mar 29, 2021 at 3:34 AM Wilfred Spiegelenburg <
> [email protected]
> > >
> > wrote:
> >
> > > Hi,
> > >
> > > Based on testing that was performed around gang scheduling and the
> spark
> > > operator by Bowen Li and Chaoran Yu we found that the behaviour around
> > the
> > > operator was far from optimal. YUNIKORN-558
> > > <https://issues.apache.org/jira/browse/YUNIKORN-558> was logged to
> help
> > > with the integration.
> > > We did not put any development or test time into making sure the
> operator
> > > and gang scheduling worked. The behaviour that was observed was not
> > linked
> > > to gang scheduling but to the generic way the operator implementation
> > works
> > > in YuniKorn.
> > >
> > > The current Spark operator, implemented
> > > in pkg/appmgmt/sparkoperator/spark.go, listens to the Spark CRD
> > > add/update/delete. Each CRD is then converted into an application
> inside
> > > YuniKorn and processed. The pods created by the Spark operator form the
> > > other half of the application. However the CRD has its own application
> > ID.
> > > The application ID for the Spark pods (drivers and executors) is
> > different.
> > >
> > > This leaves us with two applications in the system: one without pods
> (CRD
> > > based) and one with pods (the real workload). The real workload pods
> have
> > > an owner reference set to the CRD. Having two applications for one real
> > > workload is strange. It does not work correctly in the UI and gives all
> > > kinds of issues on completion and recovery on restart.
> > >
> > > The proposal is now to merge the two objects into one application
> inside
> > > YuniKorn. The CRD can still be used to track updates and provide events
> > for
> > > scheduling etc. The "ApplicationID" set in the driver or executor pods
> > > should be used to track this application.
> > > The owner reference allows linking the real pods back to the CRD. The
> CRD
> > > will be used to provide the life cycle tracking and as an event
> > collector.
> > >
> > > All these changes do require rework on the app management side. I hope
> > the
> > > proposal sounds like the correct way forward. This same CRD based
> > mechanism
> > > also seems to fit in with the way the flink operator works.
> > > Please provide some feedback on this proposal. Implementation would
> > require
> > > changes in app management and related unit tests. Recovery and gang
> > > scheduling tests should also be covered under this change.
> > >
> > > Wilfred
> > >
> >
>

Reply via email to