[ 
https://issues.apache.org/jira/browse/MESOS-6483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15616123#comment-15616123
 ] 

Yan Xu commented on MESOS-6483:
-------------------------------

Backported to 1.1.x.

{noformat:title=}
commit b18c5ccdbfcfea133fe366c82dc0578c948134b9
Author: Neil Conway <[email protected]>
Date:   Thu Oct 27 14:16:01 2016 -0700

    Avoided CHECK failure with pre-1.0 agents.
    
    We don't guarantee compatibility with pre-1.0 agents. However, since it
    is easy to avoid a CHECK failure in the master when an old agent
    re-registers, it seems worth doing so.
    
    Review: https://reviews.apache.org/r/53202/
{noformat}

> Check failure when a 1.1 master marking a 0.28 agent as unreachable
> -------------------------------------------------------------------
>
>                 Key: MESOS-6483
>                 URL: https://issues.apache.org/jira/browse/MESOS-6483
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: Megha
>            Assignee: Neil Conway
>             Fix For: 1.1.0, 1.2.0
>
>
> When upgrading directly from mesos version 0.28 to a version > 1.0 there 
> could be a scenario that may make the 
> CHECK(frameworks.recovered.contains(frameworkId)) in 
> Master::_markUnreachable(..) to fail. The following sequence of events can 
> happen.
> 1) The master gets upgraded first to the new version and the agent lets say X 
> is still at mesos version 0.28
> 2) This agent X (at mesos 0.28) attempts to re-registers with the master (at 
> lets say 1.1) and as a result doesn't send the frameworks (frameworkInfos) in 
> the ReRegisterSlave message since it wasn't available in the older mesos 
> version.
> 3) Among other frameworks on this agent X, is a framework Y which didn’t 
> re-register after master’s failover. Since the master builds the 
> frameworks.recovered from the frameworkInfos that agents provide it so this 
> framework Y is neither in the recovered nor in registered frameworks.
> 4) The agent X post re-registering fails master’s health check and is being 
> marked unreachable by the master. The check  
> CHECK(frameworks.recovered.contains(frameworkId)) will get fired for the 
> framework Y since it is neither in recovered or registered but has tasks 
> running on the agent X.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to