> On Aug. 17, 2018, 10:29 p.m., Chun-Hung Hsiao wrote:
> > src/slave/slave.cpp
> > Lines 8195 (patched)
> > <https://reviews.apache.org/r/68147/diff/6/?file=2074578#file2074578line8195>
> >
> >     Let's validate that there is no task using the resources provided by 
> > this RP before doing the removal, or fail the call.
> 
> Benjamin Bannier wrote:
>     I believe this should be treated as orthogonal issue. It might e.g., 
> happen that the user triggering the RP removal cannot kill tasks or prevent 
> them from being rescheduled. I'd suggest to leaves as is.
>     
>     WDYT?
> 
> Chun-Hung Hsiao wrote:
>     The scenario I am worried about is that when the RP is marked as gone, 
> the RP may fail to tear itself down if there is any task using its resources.
> 
> Benjamin Bannier wrote:
>     We don't seem to have a good story on incompatible resource changes ATM. 
> If we e.g., update a RP config in an incompatible way, we make not attempt to 
> detect or rectify this in any way, either. At least in the agent we were in 
> general we were operating under the assumption that tasks would die on 
> incompatible changes to their resources.
>     
>     I'd suggest to drop this issue for now and potentially file a JIRA to 
> track draining and situations where it might be needed.
> 
> Benjamin Bannier wrote:
>     Dropping this, see above comment.
> 
> Chun-Hung Hsiao wrote:
>     Reopening this. Here's my suggestion:
>     
>     1. Create a ticket for a proper design to drain the resource provider.
>     2. For now, let's fail the call here if 
> `slave->resourceProviders.at(resourceProviderId)->totalResources` is not 
> empty. In other words, the operator must first reconfigure the resource 
> provider to "drain" itself before making the `MARK_RESOURCE_PROVIDER_GONE` 
> call.
> 
> Benjamin Bannier wrote:
>     Okay.
>     
>     1. I created https://issues.apache.org/jira/browse/MESOS-9522.
>     2. We had discussed this some time ago, and I think agreed that 
> preventing operators from marking resource providers as gone whenever a task 
> was running on them was not good (especially if operators cannot drain the 
> RP). We have prior art in `MARK_AGENT_GONE` which asks operators to drain 
> agent before marking as gone. I would suggest for now with MESOS-9522 not 
> implemented we allow operators to mark resource providers as gone even if 
> some of their resources are in use; if needed they can drain them by 
> filtering all tasks using their resources. This also wouldn't expose a race.
>     
>     Does this address your concerns?

I agreed that checking if there is any task running on resources of an RP when 
marking it as gone would lead to some complexity, so I'm suggesting to use a 
stronger condition: there must be no resource.

I'm not convinced that asking the operator to drain the RP is a bad idea. My 
main concern is that, since `MARK_RESOURCE_PROVIDER_GONE` cannot be undone, we 
should only perform it when it is safe to do, so the operator can have a chance 
to fix the problem. Also, this restriction would ensure that no framework will 
have an inconsistent view of cluster resources. WDYT?


- Chun-Hung


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/68147/#review207559
-----------------------------------------------------------


On Jan. 15, 2019, 3:03 p.m., Benjamin Bannier wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/68147/
> -----------------------------------------------------------
> 
> (Updated Jan. 15, 2019, 3:03 p.m.)
> 
> 
> Review request for mesos, Chun-Hung Hsiao, Gastón Kleiman, Greg Mann, and Jan 
> Schlicht.
> 
> 
> Bugs: MESOS-8403
>     https://issues.apache.org/jira/browse/MESOS-8403
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> This patch adds support for triggering permanent removal of local
> resource providers. We also add authorization and tests as part of this
> patch.
> 
> 
> Diffs
> -----
> 
>   include/mesos/agent/agent.proto 74488e873cbf99ca487403b70691912cf3788288 
>   include/mesos/v1/agent/agent.proto 5d1ab6fd7f6f21593444467c66ed35ee90aea07f 
>   src/common/protobuf_utils.cpp a0159fed8d325808c5e8519da06173441debbbbb 
>   src/master/master.cpp 2339207149a85578ea47cf66f28392182f9075f2 
>   src/slave/http.hpp 5b113fa2f7e3421d4219e0ece113010937b204c1 
>   src/slave/http.cpp e0dc09103e9186aa3f7328d2052fc6747f5be9bb 
>   src/slave/slave.hpp 2eadf5fce9a314f1ec0ac5d51820c6381f5f1468 
>   src/slave/slave.cpp 10cbc190acc7eea5734efa0066541168545a66b6 
>   src/slave/validation.cpp df5e1373dbe497bc859455dcaf4e064e923bd72e 
>   src/tests/api_tests.cpp 79a2d2966ff6533629aeeefaaaf19e849b76e0e6 
> 
> 
> Diff: https://reviews.apache.org/r/68147/diff/13/
> 
> 
> Testing
> -------
> 
> `make check`
> 
> 
> Thanks,
> 
> Benjamin Bannier
> 
>

Reply via email to