> On June 14, 2015, 10:46 a.m., Benjamin Hindman wrote: > > Just so I understand, does this mean if we happen to get in the unfortunate > > situation where a slave has neglected to get the dynamic reservation > > because it was just starting up and then it gets the task launch it will > > shutdown the slave because the CHECK will fail? I would expect the slave to > > simply send a TASK_LOST. Said another way, this is not an assertion our > > code guarantees. If instead we were waiting for some kind of an ack from > > the slave that it received the dynamic reservation before it send the task > > launch then a CHECK would make sense. > > Jie Yu wrote: > We don't expect this to happen because we always send a > CheckpointResourcesMessage before sending the task to the slave and TCP > ensures in order delivery (out of order delivery is possible if two sockets > are used. it's possible because the way we create ephemeral connections, but > this is very unlikely to happen). Master won't send the task to the slave if > the slave hasn't registered. > > I would rather keep the CHECK here unless we found that this is a real > issue (and then we can change that to send status update). > > Michael Park wrote: > So currently it is possible for this to happen, but only with a very > small probability. Your proposal is to keep the `CHECK` and put in the effort > to eliminate the possibility once we observe it as a real problem, correct? > The part that I don't quite understand is, what's the motivation to wait for > a real problem to occur when we know it's possible to run into this issue > (even with a small probability), the effort to change the `CHECK` to sending > `TASK_LOST` seems to be small? > > Jie Yu wrote: > Well, everything has a probablity to fail, the question is how large the > probability is. Memory could have hardware errors and a bit could be flipped > due to random reasons, does that mean that we have to do parity check in > every single location in our code base? I think my point is the probability > for this to fail is extremely low so that we shouldn't worry too much. > > I am fine with sending a status update. > > Alexander Rukletsov wrote: > I wonder, what are the cases when the task launch request may arrive > before `CheckpointResourcesMessage`? If my understanding is correct, we do > not have delivery guarantee for `CheckpointResourcesMessage`, nor we have the > same queue in Master for `CheckpointResourcesMessage` and `RunTaskMessage` to > ensure the order. My intuition is that the probability of such an event is > not negligible: a network blip can occur and `CheckpointResourcesMessage` may > be lost or delayed, we can open another socket to the slave for > `RunTaskMessage`. Could you please help me understand that?
Such a situation will manifest an `Exited` event from the socket closure. At the application level, we want to ensure that if there are any `Exited` events, the slave (or framework) will re-register. This is currently not fully implemented: currently only the master-side `Exited` is implemented (we ping the slave telling it we think it is disconnected), the slave-side `Exited` is a no-op. It may become simpler with the HTTP API since we have a single duplex socket (the master does not initiate a connection with the slave (or framework)). This means that the responsibility of dealing with a closed socket is left to the slave (or framework) only. Off the top of my head, I'm not sure if there are situations where only 1 side of the socket can be broken.. so maybe it will be just as complicated :) Let's discuss off this thread, I do have some tickets around this stuff. - Ben ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/35433/#review87857 ----------------------------------------------------------- On June 19, 2015, 2:31 p.m., Michael Park wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/35433/ > ----------------------------------------------------------- > > (Updated June 19, 2015, 2:31 p.m.) > > > Review request for mesos, Alexander Rukletsov, Benjamin Hindman, and Jie Yu. > > > Bugs: MESOS-2491 > https://issues.apache.org/jira/browse/MESOS-2491 > > > Repository: mesos > > > Description > ------- > > No bug was observed (yet), but realized I forgot about this in the dynamic > reservations patches. > > > Diffs > ----- > > include/mesos/mesos.proto 8df1211165169c9595e0e6e85b5ddc404345ff70 > src/slave/slave.cpp a5ad29f59fadba919ed82ba2892c2febe551660b > > Diff: https://reviews.apache.org/r/35433/diff/ > > > Testing > ------- > > `make check` > > > Thanks, > > Michael Park > >
