> On Oct. 16, 2017, 6:33 p.m., Greg Mann wrote: > > src/messages/messages.proto > > Lines 637-638 (patched) > > <https://reviews.apache.org/r/63001/diff/1-2/?file=1856406#file1856406line637> > > > > I think that this list should include pending operations, but should > > _not_ include terminal operations with unacknowledged updates. > > > > When the master receives an update, it needs to decide whether or not > > it should update the allocator. If the operation was pending when > > `UpdateSlaveMessage` was sent, then the master should update the allocator. > > If the operation was not pending but its update had not been acknowledged, > > then the result of that operation was already reflected in the last > > `UpdateSlaveMessage` from the agent, and the master should not update the > > allocator. > > > > If the agent includes only pending operations in this field, then the > > master will be able to make this distinction.
The reason we need to send terminal but unacked operations is for reconciliation. Master always keeps a cache about all the known non-completed operations (either pending or terminal but unacked). This information can be used to answer reconciliation requests. The master can distinguish between a pending operation vs. a terminal but unacked operation by looking at latest state of the operation. The master update the allocator if the latest state if terminal, irrespective whether it's acked or not. That does bring up an issue. Consider the following two cases: 1) Master knows about an operation that the agent does not know about. This can happen if the operation gets lost on its way to the agent. This will trigger an agent re-registration. The master in this case will need to generate a reconcile message (similar to that in `reconcileKnownSlave`) so that the agent or LRP can reply with a terminal status update. This will guarantee that the resources allocated for this operation are properly recovered to allocator. This case is also possible if there is a speculation failure on old operations and result in an `UpdateSlaveMessage` being sent to the master. The operation that master knows about (but agent does not) will eventually reach agent or LRP, and will be rejected by the LRP or the agent. The master will properly recovered the resources to the allocator when a failed update is received. 2) Agent knows about an operation that the master does not know about. For instance, This is a new master that just fails over, and an LRP re-registers with the agent. In that case, the master needs to update the 'allocated' (i.e., 'used') resources for those new operations. This is similar to what we do when slave registers. However, the tricky part is that the allocator does not have an interface currently allowing us to add more allocation to a given agent. We likely need to add that (or make it part of `Allocator::updateSlave` interface). - Jie ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/63001/#review188162 ----------------------------------------------------------- On Oct. 16, 2017, 6:32 p.m., Jie Yu wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/63001/ > ----------------------------------------------------------- > > (Updated Oct. 16, 2017, 6:32 p.m.) > > > Review request for mesos, Benjamin Bannier, Benjamin Mahler, Gaston Kleiman, > Greg Mann, Jan Schlicht, and Vinod Kone. > > > Repository: mesos > > > Description > ------- > > Updated protobuf definitions related to offer operations. > > > Diffs > ----- > > include/mesos/mesos.proto ba87339dbe341f4d16ceea74adc09647a3c07f32 > include/mesos/resource_provider/resource_provider.proto > f5a9073075327019fd133bd51265f695ef464845 > include/mesos/v1/mesos.proto a6d662fb26aa4f78ef20ffe6e013f7a45f7f8c21 > include/mesos/v1/resource_provider/resource_provider.proto > e5cbede5b6e57a8641fca1ebfee5454f292cc24c > src/messages/messages.proto 0a32b3457e9143a7d48670610ca3e56dd516136f > src/resource_provider/manager.cpp 31fcb789f5ab907511e868c374c49f7457a33ed3 > src/resource_provider/validation.cpp > d2927227f60ab0d4ae2481ad73a31ee444b48ee0 > src/tests/resource_provider_validation_tests.cpp > f182bff4670318e9de22c2915c5dbb423a74ad6c > > > Diff: https://reviews.apache.org/r/63001/diff/3/ > > > Testing > ------- > > make check > > > Thanks, > > Jie Yu > >