Hi folks,

Recently I've being discussing the problems of the current design of the
experimental
`RECONCILE_OPERATIONS` scheduler API with a couple people. The discussion
was started
from MESOS-9318 <https://issues.apache.org/jira/browse/MESOS-9318>: when a
framework receives an `OPERATION_UNKNOWN`, it doesn't know
if it should retry the operation or not (further details described below).
As the discussion
evolves, we realize there are more issues to consider, design-wise and
implementation-wise, so
I'd like to reach out to the community to get valuable opinions from you
guys.

Before I jump right into the issues I'd like to discuss, let me fill you
guys in with some
background of operation reconciliation. Since the design of this feature
was informed by the
pre-existing implementation of task reconciliation, I'll begin there.

*Task Reconciliation: Design*

The scheduler API has a `RECONCILE` call for a framework to query the
current statuses
of its tasks. This call supports the following modes:

   - *Explicit reconciliation*: The framework specifies the list of tasks
   it wants to know
   about, and expects status updates for these tasks.

   - *Implicit reconciliation*: The framework does not specify a list of
   tasks, and simply
   expects status updates for all tasks the master knows about.

In both cases, the master looks into its in-memory task bookkeeping and
sends
*one or more`UPDATE` events* to respond to the reconciliation request.

*Task Reconciliation: Problems*

This API design of task reconciliation has the following shortcomings:

   - (1) There is no clear boundary of when the "reconciliation response"
   ends, and thus
   there is
*no 1-1 correspondence between the reconciliation request and the response*.
   For explicit reconciliation, the framework might wait for an extended period
   of time before it receives all status updates; for implicit
   reconciliation, there is no way for
   a framework to tell if it has learned about all of its tasks, which
   could be inconvenient if
   the framework has lost its task bookkeeping.

   - (2) The "reconciliation response" may be outdated. If an agent
   reregisters after a task
   reconciliation has been responded,
*the framework wouldn't learn about the tasks **from this recovered agent*.
   Mesos relies on the framework to call the `RECONCILE` call
   *periodically* to get up-to-date task statuses.



*Operation Reconciliation: Design & Problems*

When designing operation reconciliation, we made the `RECONCILE_OPERATIONS`
call
*asynchronous request-response style call* that returns a 200 OK with a
list of operation status
to avoid (1). However, this design does not resolve (2), and also
introduces new problems:

   - (3) *The synchronous response could race with the event stream* and
   the framework
   does not know which contains the latest operation status.

   - (4) To ensure scalability, the master does not manage local resource
   providers (LRPs);
   the agents do. So the master cannot tell if an LRP is temporarily
   unreachable/recovering
   or permanently gone. As a result, if the framework explicitly reconciles
   an LRP operation
   that the master does not know about, it can only reply
   `OPERATION_UNKNOWN`, but
   then *the framework would not know if the operation would come back in
   the future*,
   and thus cannot decide if it should reissue another operation, which
   leads to MESOS-9318 <https://issues.apache.org/jira/browse/MESOS-9318>.

   Note that this is less of a problem for explicit task reconciliation,
   because in most cases
   the master can infer task statuses from agent statuses, and in the rare
   cases that it
   replies `TASK_UNKNOWN`, it is generally safe for the framework to
   relaunch another
   task.


*The Open Question*

Now, the big question here is:
*are the benefits of a synchronous request-responsestyle
`RECONCILE_OPERATIONS` call worth the complexity it introduces* in order to
address (3) and (4) in the code? To explain what the complexity would be,
let me lay out a
couple proposals we've been discussing:

I. Keep `RECONCILE_OPERATIONS` synchronous

To address (3), we could add a *timestamp* to every operation status as
well as the
reconciliation response, so the framework can infer which one is the latest
status, and if
it receives a stale operation status update after the reconciliation
response, it can just
ack the status update without updating its bookkeeping. But, the framework
needs to
deal with a corner case:

*when it receives a reconciliation response containing aterminal operation
status, it may or may not receive one or more status updatesfor that
operation later *because of the race.


To address (4), we could either: (a) surface the unreachable/gone LRPs to
the master, or
(b) forward the explicit reconciliation request to the corresponding agent.
The complexity
of (a) is that
*it might not be scalable for the master to maintain the list ofunreachable
and gone LRPs*: imagine that there are 1k nodes and 10 active + 10 gone
LRPs per node, then the master need to maintain 20k entries for LRPs. The
complexity
of (b) is that the response wouldn't be computed based on the master's
state; instead,
*the master needs to wait for the agent's reply to respond to the framework*.
Note
that it's probably not scalable to forward implicit reconciliation requests
to all agents, so
implicit reconciliation might have to still be responded based on the
master's state.


II. Make `RECONCILE_OPERATIONS` "semi-synchronous"

Instead of returning a 200 OK, the master could return a 202 Accepted with
an empty
body, and then
*reply a single event containing the operation status of all
requestedoperations in the event stream asynchronously*. Although the
framework loses the
1-1 correspondence between the request and the response, there's still a
clear boundary
for a reconciliation response. The advantage of this approach compared to
proposal I is
that we don't have a race between the reconciliation response and the event
stream, so
no timestamp is required. Still, we have to address (4) through either (a)
or (b) described
above, thus the complexity remains. That said, this approach fits with (b)
better since no
synchronous response is needed.


III. Make `RECONCILE_OPERATIONS` an asynchronous trigger

This would be similar to what we have for task reconciliation. The master
would return a
202 Accepted, and then send
*one or more `UPDATE_OPERATION_STATUS` events*based on its state for an
implicit reconciliation, or
*forward the request to some agent*for an explicit reconciliation. In other
words, this call plays the role of a trigger of the
operation status updates. This approach is the simplest in terms of the
implementation,
but the trade-off is that the framework needs to live with (1).


So far we haven't discussed much about (2) for operation reconciliation, so
let's also briefly talk
about it. Potentially (2) can be addressed by making the agent *actively
push *
*operation statusupdates to the framework when an LRP is resubscribed*, so
the framework won't need to do
periodic operation reconciliation. If we do this in the future, it would
also be more aligned with
proposal II or III.

So the question again: is it worth the complexity to keep
`RECONCILE_OPERATIONS`
synchronous? I'd like to hear the opinions from the community so we can
drive towards a better
API design!

Best,
Chun-Hung

Reply via email to