Re: Review Request 35433: Sent StatusUpdates if checkpointed resources don't exist on the slave.

Ben Mahler Fri, 19 Jun 2015 15:44:34 -0700


> On June 14, 2015, 10:46 a.m., Benjamin Hindman wrote:
> > Just so I understand, does this mean if we happen to get in the unfortunate 
> > situation where a slave has neglected to get the dynamic reservation 
> > because it was just starting up and then it gets the task launch it will 
> > shutdown the slave because the CHECK will fail? I would expect the slave to 
> > simply send a TASK_LOST. Said another way, this is not an assertion our 
> > code guarantees. If instead we were waiting for some kind of an ack from 
> > the slave that it received the dynamic reservation before it send the task 
> > launch then a CHECK would make sense.
> 
> Jie Yu wrote:
>     We don't expect this to happen because we always send a 
> CheckpointResourcesMessage before sending the task to the slave and TCP 
> ensures in order delivery (out of order delivery is possible if two sockets 
> are used. it's possible because the way we create ephemeral connections, but 
> this is very unlikely to happen). Master won't send the task to the slave if 
> the slave hasn't registered.
>     
>     I would rather keep the CHECK here unless we found that this is a real 
> issue (and then we can change that to send status update).
> 
> Michael Park wrote:
>     So currently it is possible for this to happen, but only with a very 
> small probability. Your proposal is to keep the `CHECK` and put in the effort 
> to eliminate the possibility once we observe it as a real problem, correct? 
> The part that I don't quite understand is, what's the motivation to wait for 
> a real problem to occur when we know it's possible to run into this issue 
> (even with a small probability), the effort to change the `CHECK` to sending 
> `TASK_LOST` seems to be small?
> 
> Jie Yu wrote:
>     Well, everything has a probablity to fail, the question is how large the 
> probability is. Memory could have hardware errors and a bit could be flipped 
> due to random reasons, does that mean that we have to do parity check in 
> every single location in our code base? I think my point is the probability 
> for this to fail is extremely low so that we shouldn't worry too much.
>     
>     I am fine with sending a status update.
> 
> Alexander Rukletsov wrote:
>     I wonder, what are the cases when the task launch request may arrive 
> before `CheckpointResourcesMessage`? If my understanding is correct, we do 
> not have delivery guarantee for `CheckpointResourcesMessage`, nor we have the 
> same queue in Master for `CheckpointResourcesMessage` and `RunTaskMessage` to 
> ensure the order. My intuition is that the probability of such an event is 
> not negligible: a network blip can occur and `CheckpointResourcesMessage` may 
> be lost or delayed, we can open another socket to the slave for 
> `RunTaskMessage`. Could you please help me understand that?

Such a situation will manifest an `Exited` event from the socket closure. At 
the application level, we want to ensure that if there are any `Exited` events, 
the slave (or framework) will re-register. This is currently not fully 
implemented: currently only the master-side `Exited` is implemented (we ping 
the slave telling it we think it is disconnected), the slave-side `Exited` is a 
no-op.

It may become simpler with the HTTP API since we have a single duplex socket 
(the master does not initiate a connection with the slave (or framework)). This 
means that the responsibility of dealing with a closed socket is left to the 
slave (or framework) only. Off the top of my head, I'm not sure if there are 
situations where only 1 side of the socket can be broken.. so maybe it will be 
just as complicated :)

Let's discuss off this thread, I do have some tickets around this stuff.

- Ben

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/35433/#review87857
-----------------------------------------------------------

On June 19, 2015, 2:31 p.m., Michael Park wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/35433/
> -----------------------------------------------------------
> 
> (Updated June 19, 2015, 2:31 p.m.)
> 
> 
> Review request for mesos, Alexander Rukletsov, Benjamin Hindman, and Jie Yu.
> 
> 
> Bugs: MESOS-2491
>     https://issues.apache.org/jira/browse/MESOS-2491
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> No bug was observed (yet), but realized I forgot about this in the dynamic 
> reservations patches.
> 
> 
> Diffs
> -----
> 
>   include/mesos/mesos.proto 8df1211165169c9595e0e6e85b5ddc404345ff70 
>   src/slave/slave.cpp a5ad29f59fadba919ed82ba2892c2febe551660b 
> 
> Diff: https://reviews.apache.org/r/35433/diff/
> 
> 
> Testing
> -------
> 
> `make check`
> 
> 
> Thanks,
> 
> Michael Park
> 
>

Re: Review Request 35433: Sent StatusUpdates if checkpointed resources don't exist on the slave.

Reply via email to