Re: Review Request 68147: Added agent support to remove local resource providers.

Benjamin Bannier Wed, 31 Oct 2018 07:39:02 -0700


> On Aug. 18, 2018, 12:29 a.m., Chun-Hung Hsiao wrote:
> > src/slave/slave.cpp
> > Lines 7934 (patched)
> > <https://reviews.apache.org/r/68147/diff/6/?file=2074578#file2074578line7934>
> >
> >     `OPERATION_GONE_BY_OPERATOR` is not a terminal state:
> >     
> > https://github.com/apache/mesos/blob/808485da01387bd27a51cb82a90b1f8301d613ee/include/mesos/mesos.proto#L2342-L2349
> 
> Benjamin Bannier wrote:
>     From https://reviews.apache.org/r/68410/#comment_rc290945-207560,
>     
>     > ... It seems that OPERATOR_GONE_BY_OPERATOR does not carry the 
> semantics we want. Looking at the other possible error codes, it seems we 
> could either use OPERATION_ERROR or OPERATION_FAILED instead where 
> OPERATION_ERROR possibly better fits our requirements as the operation's side 
> effects might have materialized. That would mean we should drop this patch 
> completely.  
>     > WDYT?
> 
> Chun-Hung Hsiao wrote:
>     I'm not sure if either is a good choice:
>     
>     `OPERATION_ERROR` usually means some validation failure before applying 
> the operation, the consumed resource remains (i.e., can be used by other 
> operations), but the operation is non-retryable (i.e., the caller must fix 
> the error).
>     `OPERATION_FAILED` usually means something goes wrong after applying the 
> operation, the consumed resource remains, and the operation is retryable.
>     
>     In the case of mark RP gone, although we can argue that any operation 
> becomes an non-retryable one and we don't care about the consumed resoure 
> because it will then be removed. There still are some issues:
>     
>     1. The reason of the operation being non-retryable is not due to any 
> client-side error, but a server-side "error," and this seems inconsistent to 
> the current `OPERATION_ERROR` usage.
>     2. We transition an operation to `OPERATION_GONE_BY_OPERATOR` when 
> marking an agent as gone, but to `OPERATION_ERROR` when marking a local RP as 
> gone. This sounds inconsistent to me from an API perspective.
>     3. For `OPERATION_ERROR` and `OPERATION_FAILED`, Mesos provides reliable 
> retry (at least under "normal" circumstances), and expects acks. For 
> `OPERATION_GONE_BY_OPERATOR`, we could just do a best-effort notification 
> (the implementation is missing though) and don't need any ack, similar to 
> `TASK_GONE_BY_OPERATOR`. In the case of mark RP as gone, we probably want to 
> go with the later.
>     
>     So it seems better if we still go with `OPERATION_GONE_BY_OPERATOR` and 
> make the proper master changes. And we need to make sure that we call 
> `allocator->updateSlave` before calling `allocator->recoverResources` to 
> avoid the consumed resource being offered to other framework in between.


I added changes to use `OPERATION_GONE_BY_OPERATOR` here, and to treat it as 
terminal.


> On Aug. 18, 2018, 12:29 a.m., Chun-Hung Hsiao wrote:
> > src/slave/slave.cpp
> > Lines 8195 (patched)
> > <https://reviews.apache.org/r/68147/diff/6/?file=2074578#file2074578line8195>
> >
> >     Let's validate that there is no task using the resources provided by 
> > this RP before doing the removal, or fail the call.
> 
> Benjamin Bannier wrote:
>     I believe this should be treated as orthogonal issue. It might e.g., 
> happen that the user triggering the RP removal cannot kill tasks or prevent 
> them from being rescheduled. I'd suggest to leaves as is.
>     
>     WDYT?
> 
> Chun-Hung Hsiao wrote:
>     The scenario I am worried about is that when the RP is marked as gone, 
> the RP may fail to tear itself down if there is any task using its resources.
> 
> Benjamin Bannier wrote:
>     We don't seem to have a good story on incompatible resource changes ATM. 
> If we e.g., update a RP config in an incompatible way, we make not attempt to 
> detect or rectify this in any way, either. At least in the agent we were in 
> general we were operating under the assumption that tasks would die on 
> incompatible changes to their resources.
>     
>     I'd suggest to drop this issue for now and potentially file a JIRA to 
> track draining and situations where it might be needed.

Dropping this, see above comment.


- Benjamin


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/68147/#review207559
-----------------------------------------------------------


On Oct. 25, 2018, 12:47 p.m., Benjamin Bannier wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/68147/
> -----------------------------------------------------------
> 
> (Updated Oct. 25, 2018, 12:47 p.m.)
> 
> 
> Review request for mesos, Chun-Hung Hsiao, Gastón Kleiman, Greg Mann, and Jan 
> Schlicht.
> 
> 
> Bugs: MESOS-8403
>     https://issues.apache.org/jira/browse/MESOS-8403
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> This patch adds support for triggering permanent removal of local
> resource providers. We also add authorization and tests as part of this
> patch.
> 
> 
> Diffs
> -----
> 
>   include/mesos/agent/agent.proto 74488e873cbf99ca487403b70691912cf3788288 
>   include/mesos/v1/agent/agent.proto 5d1ab6fd7f6f21593444467c66ed35ee90aea07f 
>   src/common/protobuf_utils.cpp 2ae37ac610894652a0214022a0cf04c4365396dd 
>   src/master/master.cpp f88c7c1f03f0de7236aad9e3bf4bfac82e91bc65 
>   src/slave/http.hpp 5b113fa2f7e3421d4219e0ece113010937b204c1 
>   src/slave/http.cpp a4db532cc6477c386fcd9bf563895214e95a475a 
>   src/slave/slave.hpp 0bd340176e2a8cefdfa7ef71e059441fb171aff6 
>   src/slave/slave.cpp 74f6fb9036a9ac4f587f53ec2df04eeb4c167bfb 
>   src/slave/validation.cpp df5e1373dbe497bc859455dcaf4e064e923bd72e 
>   src/tests/api_tests.cpp fdd9f871f75617fc26a28679e2a1e41f506c6133 
> 
> 
> Diff: https://reviews.apache.org/r/68147/diff/10/
> 
> 
> Testing
> -------
> 
> `make check`
> 
> 
> Thanks,
> 
> Benjamin Bannier
> 
>

Re: Review Request 68147: Added agent support to remove local resource providers.

Reply via email to