Hi all,

As some might have already known, we are currently working on patches to
implement the new GROW_VOLUME and SHRINK_VOLUME operations [1].

One problem surfaces is that, since the new operations are not supported in
Mesos 1.5, they will lead to an agent crash during the operation application
cycle if a Mesos 1.6 master send these operations to a Mesos 1.5 agent [2].

We are now consider two possibilities to address this compatibility problem:

1) The Mesos 1.6 master should check the agent's Mesos version in
`Master::accept` [3]. Moving forward, if we add new operations in future
Mesos
releases, we would have code like the following:

```
Version slaveVersion = ...; // Get the Mesos version of the slave of the
offer.
switch (operation.type()) {
  ...
  case SOME_NEW_OPERATION: {
    if (slaveVersion < minVersionForSomeNewOperation) {
      ... // Drop the operation.
    }
    break;
  }
  ...
}
```

Pros and cons:
+ The new operation won't go into the operation application cycle since it
is
  rejected in the very beginning. This means no resource metadata is
touched.
- Explicit slave version checks at master side make the code look not very
clean,
  and we will need to update this list every time we add a new operation.

2) Treat this issue as an agent crash bug. The Mesos master would forward
the operation to the agent, regardless of the agent's Mesos version. In the
agent,
we deploy and backport the following logic in `Slave::applyOperation` [4]:

```
if (message.operation_info().type() == OPERATION_UNKNOWN) {
  ... // Drop the operation and trigger a re-registration or send an
      // `UpdateSlaveMessage` to force the master to update the total
resource of
      // the slave.
}
```

Pros and cons:
+ Easier to add new operations since no new logic needs to be added for
backward
  Compability.
- Since the old agent won't know whether the new operations are speculative
or not,
  a re-registration or an `UpdateSlaveMessage` is required.
- Mesos 1.5.0 agents will still have the bug and crash when a new master
sends a
  new operation to them.

Since both options are viable and there seems to be no clear winner, we'd
like to
check with the community to see which convention is preferable. Please let
us know
what you think. Thanks!

Best,
Chun-Hung


[1] https://issues.apache.org/jira/browse/MESOS-4965
[2]
https://github.com/apache/mesos/blob/1.5.x/src/common/protobuf_utils.cpp#L851
[3] https://github.com/apache/mesos/blob/master/src/master/master.cpp#L3899
[4] https://github.com/apache/mesos/blob/1.5.x/src/slave/slave.cpp#L4359

Reply via email to