Hi all,

We have seen a number of issues in the core scheduling around the data sync
between the cache and the scheduling objects.
The current design has a cache and a scheduling layer. The cache stores
most of the data related to applications, queues and nodes. The scheduler
in a mirrored structure stores scheduling related data that is not tracked
in the cache, like a schedulingAsk. The main objects have a one to one
relation between cache and scheduler. Each object in the scheduler has a
corresponding object in the cache. These objects are loosely coupled using
events.

This is where the issues pop up. Event processing is not guaranteed. The
scheduler when scheduling might make a decision, like allocate a request,
which results in an event being sent to the cache to update. While the
event is waiting for processing by the cache the scheduler might make more
decisions. These decisions could then be made using out of date information.

We have no way to guarantee the processing of the events and the timing
involved. For this reason we are proposing, and have started work on,
removing the cache from the scheduler. The functionality and data that is
currently stored in cache will be moved into the scheduler.

YUNIKORN-317 [1] has been logged for this work. It has a design document
attached to the jira. Based on our first analysis and the first part of
work we did we want to propose working on this in a development branch. We
hope to finish this work before we release 0.10 which we have targeted. We
will merge the main branch regularly and merge back into the main branch
when:
- we pass the same unit and e2e tests as the current main branch
- we pass an accept vote for the changes on this list.

I will create the branch to start the work.

Thank you,
Wilfred

[1] https://issues.apache.org/jira/browse/YUNIKORN-317

Reply via email to