Re: Review Request 61473: Do not kill non partition aware tasks.

Jiang Yan Xu Fri, 29 Sep 2017 11:20:13 -0700

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/61473/#review186615
-----------------------------------------------------------





src/master/http.cpp
Lines 324 (patched)
<https://reviews.apache.org/r/61473/#comment263308>

    Leave a blank line between these if blocks. Here and below.



src/master/http.cpp
Lines 350 (patched)
<https://reviews.apache.org/r/61473/#comment263309>

    Leave a blank line between these if blocks.



src/master/http.cpp
Line 4203 (original), 4220-4226 (patched)
<https://reviews.apache.org/r/61473/#comment263501>

    NPA tasks already have the state `TASK_LOST` right? Could you check that?
    
    The difference is that previously, if a framework was PA (and its 
unreachable tasks were stored in  `unreachableTasks` as `UNREACHABLE`) and 
later it changed to NPA again, the code here made no attempt to double check to 
change it to `TASK_LOST`. The new code would do it though.
    
    It's debatable which is more right but I'd say it's safe to maintain the 
existing behavior?



src/master/master.hpp
Lines 2475-2477 (patched)
<https://reviews.apache.org/r/61473/#comment263317>

    Because this method takes a pointer, this mutation could affect future uses 
of it. Even though right now nothing that cares about the state follows the 
call of `addUnreachableTask`, it may still be good to not propagate the change.
    
    How about doing it in `addUnreachableTask`?
    
    ```
    void addUnreachableTask(const Task& _task)
    {
      Task* task = new Task(_task);
    
      // We have to use TASK_LOST for non-partition-aware frameworks
      // for backwards compatibility.
      if (!capabilities.partitionAware) {
        task->set_state(TASK_LOST);
      }
    
      unreachableTasks.set(task.task_id(), process::Owned<Task>(task));
    }
    ```



src/master/master.hpp
Line 2844 (original), 2840 (patched)
<https://reviews.apache.org/r/61473/#comment263318>

    s/non-/Non-/ to be consistent with other instances of `NOTE`s.



src/master/master.cpp
Line 6448 (original), 6439 (patched)
<https://reviews.apache.org/r/61473/#comment263319>

    The decision is pretty simple right now: all tasks are added unless the 
framework is completed. So this separate introduction paragraph seems redudant 
and can be removed now:
    
    ```
    // Decide how to handle the tasks running on the agent:
    //
    ```



src/master/master.cpp
Line 6450 (original), 6441 (patched)
<https://reviews.apache.org/r/61473/#comment263320>

    "All tasks" except for those belonging to completed frameworks, so to 
summarize the block of code below, we should probably mention this.



src/master/master.cpp
Lines 6453-6456 (original), 6443-6446 (patched)
<https://reviews.apache.org/r/61473/#comment263321>

    "no further cleanup is required" is true for all  circumstances now so I 
think this comment is not important anymore.
    
    If we comment about the code below ignoring the history of how we got here 
but just as it is, we could probably just remove the whole sentence:
    
    ```
      // If the master has failed
      // over since the agent was marked unreachable then the master shouldn't
      // have any record of the tasks running on the agent, so no further
      // cleanup is required.
    ```



src/master/master.cpp
Lines 7188-7190 (original), 7144-7146 (patched)
<https://reviews.apache.org/r/61473/#comment263411>

    Our handling of `TASK_UNREACHABLE` vs. `TASK_LOST` here is a little 
different than elsewhere so I think this warrants a bit of explanation.
    
    e.g., 
    ```
    // Transition tasks to TASK_UNREACHABLE and remove (archive) them.
    // We convert the task state to TASK_LOST if is the framework is not 
partition aware.
    // However we only do the conversion right before the status update is sent 
out or the
    // task is archived because the processing prior to then requires tasks to 
be of the 
    // correct state TASK_UNREACHABLE.
    ```
    
    Does this sound right?



src/master/master.cpp
Lines 7169-7173 (patched)
<https://reviews.apache.org/r/61473/#comment263406>

    You have created this `newUpdate` but are not immediately using it, and I 
have to pay attention to notice that it is `update` instead of `newUpdate` that 
is passed to `updateTask()`.
    
    So perhaps defer this change to after `updateTask` and `removeTask` are 
called? At that point because the only remaining use of the `update` is to send 
it out, you don't even need to make a copy any more. Just change the field.



src/master/master.cpp
Lines 7171-7172 (patched)
<https://reviews.apache.org/r/61473/#comment263399>

    This could be shortened to one line.
    
    ```
    update.mutable_status()->set_state(TASK_LOST);
    ```



src/master/master.cpp
Line 7218 (original), 7174 (patched)
<https://reviews.apache.org/r/61473/#comment263402>

    Looking into the method `updateTask`, passing in `update` here will cause 
the operator API subscribers to receive `TASK_UNREACHABLE` for NPA task. So we 
should probably handle it inside.



src/master/master.cpp
Lines 8989-8990 (original), 8945-8946 (patched)
<https://reviews.apache.org/r/61473/#comment263403>

    This is going to send `TASK_UNREACHABLE` to the operator API subscribers 
even for NPA framework tasks. 
    
    We should probably be consistent and send `TASK_LOST`.


- Jiang Yan Xu


On Sept. 24, 2017, 3:46 p.m., Megha Sharma wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/61473/
> -----------------------------------------------------------
> 
> (Updated Sept. 24, 2017, 3:46 p.m.)
> 
> 
> Review request for mesos, James Peach, Vinod Kone, and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-7215
>     https://issues.apache.org/jira/browse/MESOS-7215
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Master will not kill the tasks for non-Partition aware frameworks
> when an unreachable agent re-registers with the master.
> Master used to send a ShutdownFrameworkMessages to the agent
> to kill the tasks from non partition aware frameworks including the
> ones that are still registered which was problematic because the offer
> from this agent could still go to the same framework which could then
> launch new tasks. The agent would then receive tasks of the same
> framework and ignore them because it thinks the framework is shutting
> down. The framework is not shutting down of course, so from the master
> and the scheduler’s perspective the task is pending in STAGING forever
> until the next agent reregistration, which could happen much later.
> This commit fixes the problem by not shutting down the non-partition
> aware frameworks on such an agent.
> 
> 
> Diffs
> -----
> 
>   src/master/http.cpp 28d0393fb5962df4d731521265efd81a54e1e655 
>   src/master/master.hpp 05f88111afb4fa0e2baf57106e1479914c16a113 
>   src/master/master.cpp 6d84a26bff970b842b58dfb69dbf232ba5c16a20 
>   src/tests/partition_tests.cpp 0886f4890ac3fec6f38146946892769a99c3e68f 
> 
> 
> Diff: https://reviews.apache.org/r/61473/diff/7/
> 
> 
> Testing
> -------
> 
> make check
> 
> 
> Thanks,
> 
> Megha Sharma
> 
>

Re: Review Request 61473: Do not kill non partition aware tasks.

Reply via email to