[ 
https://issues.apache.org/jira/browse/MESOS-8750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16463379#comment-16463379
 ] 

Yan Xu commented on MESOS-8750:
-------------------------------

{code:title=}
commit 520b729857223aeade345cbdf61209ec4f395ad9
Author: Megha Sharma <[email protected]>
Date:   Thu May 3 22:09:02 2018 -0700

    Remove unknown unreachable tasks when agent reregisters.
    
    A RunTaskMesssage could get dropped for an agent while it's
    disconnected from the master and when such an agent goes unreachable
    then this dropped task message gets added to the unreachable tasks.
    When the agent reregisters, the master sends status updates for the
    tasks that the agent reported when re-registering and these tasks are
    also removed from the unreachableTasks on the framework but since the
    agent doesn't know about the dropped task so it doesn't get removed
    from the unreachableTasks leading to a check failure when
    this inconsistency is detected during framework removal.
    
    Review: https://reviews.apache.org/r/66644/
{code}

> Check failed: !slaves.registered.contains(task->slave_id)
> ---------------------------------------------------------
>
>                 Key: MESOS-8750
>                 URL: https://issues.apache.org/jira/browse/MESOS-8750
>             Project: Mesos
>          Issue Type: Task
>          Components: master
>    Affects Versions: 1.6.0
>            Reporter: Megha Sharma
>            Assignee: Megha Sharma
>            Priority: Critical
>
> It appears that in certain circumstances an unreachable task doesn't get 
> cleaned up from the framework.unreachableTasks when the respective agent 
> re-registers leading to this check failure later when the framework is being 
> removed. When an agent goes unreachable master adds the tasks from this agent 
> to {{framework.unreachableTasks}} and when such an agent re-registers the 
> master removes the tasks that it specifies during re-registeration from this 
> datastructure but there could be tasks that the agent doesn't know about e.g. 
> if the runTask message for them got dropped and so such tasks will not get 
> removed from unreachableTasks.
> {noformat}
> F0310 13:30:58.856665 62740 master.cpp:9671] Check failed: 
> !slaves.registered.contains(task->slave_id()) Unreachable task <taskID> of 
> framework 4f57975b-05dd-4118-8674-5b29a86c6a6c-0850 was found on registered 
> agent 683c4a92-b5a0-490c-998a-6113fc86d37a-S1428
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to