[jira] [Commented] (MESOS-4049) Allow user to control behavior of partitioned agents/tasks

Vinod Kone (JIRA) Fri, 26 Aug 2016 15:05:38 -0700

    [ 
https://issues.apache.org/jira/browse/MESOS-4049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15439996#comment-15439996
 ]


Vinod Kone commented on MESOS-4049:
-----------------------------------

Author: Neil Conway <neil.con...@gmail.com>
Date:   Fri Aug 26 14:48:47 2016 -0700

    Made a few minor tweaks to comments.
    
    Review: https://reviews.apache.org/r/50704/

commit 0b90cccaca0069a2e2fff54d1424d205659346a3
Author: Neil Conway <neil.con...@gmail.com>
Date:   Fri Aug 26 14:48:39 2016 -0700

    Removed a no-longer-relevant test.
    
    The behavior this test is trying to validate (slaves receive a
    `ShutdownMessage` if they attempt to reregister after failing health
    checks) will be changed shortly. Moreover, the new behavior is already
    covered by other test cases.
    
    Review: https://reviews.apache.org/r/50703/

commit 93016d37bf8833d7a78ada9c4ec59a374419ba35
Author: Neil Conway <neil.con...@gmail.com>
Date:   Fri Aug 26 14:48:16 2016 -0700

    Renamed metrics from "slave_shutdowns" to "slave_unreachable".
    
    The master will shortly be changed to no longer shutdown unhealthy
    agents, so the previous metric name is no longer accurate. The old
    metric names have been kept for backwards compatibility, but they
    are no longer updated (i.e., they will always be set to zero).
    
    Review: https://reviews.apache.org/r/50702/

commit af496f3a80da9a8e7961fb62f839aacf1658222e
Author: Neil Conway <neil.con...@gmail.com>
Date:   Fri Aug 26 14:48:07 2016 -0700

    Added registrar operations for marking agents (un-)reachable.
    
    Review: https://reviews.apache.org/r/50701/

commit 540591407729ae9eaf81f68cb025b181782c5b99
Author: Neil Conway <neil.con...@gmail.com>
Date:   Fri Aug 26 14:48:03 2016 -0700

    Added a list of "unreachable" agents to the registry.
    
    These are agents that have failed health checks.
    
    Review: https://reviews.apache.org/r/50700/

commit c3268cad3621a6373ff331d882393b2ada064f4b
Author: Neil Conway <neil.con...@gmail.com>
Date:   Fri Aug 26 14:47:53 2016 -0700

    Added new TaskState values and PARTITION_AWARE capability.
    
    TASK_DROPPED, TASK_UNREACHABLE, TASK_GONE, TASK_GONE_BY_OPERATOR, and
    TASK_UNKNOWN. These values are intended to replace the existing
    TASK_LOST state by offering more fine-grained information on the
    current state of a task. These states will only be sent to frameworks
    that opt into this new behavior via the PARTITION_AWARE capability.
    
    Note that this commit doesn't add a master metric for the TASK_UNKNOWN
    status, because this is a "default" status reported when the master has
    no knowledge of a particular task/agent ID. Hence the number of
    "unknown" tasks at any given time is not a well-defined metric.
    
    Review: https://reviews.apache.org/r/50699/


> Allow user to control behavior of partitioned agents/tasks
> ----------------------------------------------------------
>
>                 Key: MESOS-4049
>                 URL: https://issues.apache.org/jira/browse/MESOS-4049
>             Project: Mesos
>          Issue Type: Improvement
>          Components: master, slave
>            Reporter: Neil Conway
>            Assignee: Neil Conway
>              Labels: mesosphere
>
> At present, if an agent is partitioned away from the master, the master waits 
> for a period of time (see MESOS-4048) before deciding that the agent is dead. 
> Then it marks the agent as lost, sends {{TASK_LOST}} messages for all the 
> tasks running on the agent, and instructs the agent to shutdown.
> Although this behavior is desirable for some/many users, it is not ideal for 
> everyone. For example:
> * Some users might want to aggressively start a new replacement task (e.g., 
> after one or two ping timeouts are missed); then when the old copy of the 
> task comes back, they might want to make an intelligent decision about how to 
> reconcile this situation (e.g., kill old, kill new, allow both to continue 
> running).
> * Some frameworks might want different behavior from other frameworks, or to 
> treat some tasks differently from other tasks. For example, if a task has a 
> huge amount of state that would need to be regenerated to spin up another 
> instance, the user might want to wait longer before starting a new task to 
> increase the chance that the old task will reappear.
> To do this, we'd need to change task state so that a task can go from 
> {{RUNNING}} to a new state (say {{UNKNOWN}} or {{WANDERING}}), and then from 
> that state back to {{RUNNING}} (or perhaps we could keep the current 
> "mark-lost-after-timeout" behavior as an option, in which case {{UNKNOWN}} 
> could also transition to {{LOST}}). The agent would also keep its old 
> {{slaveId}} when it reconnects.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-4049) Allow user to control behavior of partitioned agents/tasks

Reply via email to