Ilya created MESOS-10209:
----------------------------

             Summary: Agent reregistration and marking race
                 Key: MESOS-10209
                 URL: https://issues.apache.org/jira/browse/MESOS-10209
             Project: Mesos
          Issue Type: Bug
          Components: master
    Affects Versions: 1.11.0
            Reporter: Ilya
            Assignee: Ilya


After master failover if an agent attempts to reregister while it is being 
marked as unreachable and reregistration finishes before the 
{{MarkUnreachable}} operation is complete, the assertion that the agent is in 
the {{recovered}} set in {{Master::\_markUnreachable()}} [1] fails. When 
readmitting the agent the master removes it from the {{recovered}} set in 
{{Master::\_\_reregisterSlave()}} [2]. If {{\_\_reregisterSlave()}} is executed 
before {{\_markUnreachable()}}, it breaks the assertion.

Example:
{noformat}
I1215 02:10:02.657672 498611 master.cpp:2170] Elected as the leading master!
I1215 02:10:08.415233 498563 master.cpp:1819] Recovered ??? agents from the 
registry (???B); allowing 10mins for agents to reregister
I1215 02:20:08.128789 498569 master.cpp:2037] Scheduling removal of agent 
696ccc98-6b05-48ae-a4f9-0543d04423be-S49198 (meta-slave-test-3-82-50); did not 
reregister within 10mins after master failover
I1215 02:20:16.480931 498596 master.cpp:9469] Marking agent 
696ccc98-6b05-48ae-a4f9-0543d04423be-S49198 (meta-slave-test-3-82-50) 
unreachable: did not reregister within 10mins after master failover
I1215 02:20:16.864944 498560 master.cpp:7439] Received reregister agent message 
from agent 696ccc98-6b05-48ae-a4f9-0543d04423be-S49198 at 
slave(1)@10.1.2.3:31478 (meta-slave-test-3-82-50)
I1215 02:20:16.865509 498560 master.cpp:7980] Re-registered agent 
696ccc98-6b05-48ae-a4f9-0543d04423be-S49198 at slave(1)@10.1.2.3:31478 
(meta-slave-test-3-82-50) with cpus:64; mem:32000; disk:320000; 
ports:[31000-32000]
I1215 02:20:16.869235 498553 master.cpp:8370] Received update of agent 
696ccc98-6b05-48ae-a4f9-0543d04423be-S49198 at slave(1)@10.1.2.3:31478 
(meta-slave-test-3-82-50) with total oversubscribed resources {}
I1215 02:20:16.869263 498553 master.cpp:8487] Ignoring update on agent 
696ccc98-6b05-48ae-a4f9-0543d04423be-S49198 at slave(1)@10.1.2.3:31478 
(meta-slave-test-3-82-50) as it reports no changes
I1215 02:20:16.869755 498605 hierarchical.cpp:854] Added agent 
696ccc98-6b05-48ae-a4f9-0543d04423be-S49198 (meta-slave-test-3-82-50) with 
cpus:64; mem:32000; disk:320000; ports:[31000-32000] (allocated: {})
I1215 02:20:22.541494 498591 master.cpp:9512] Marked agent 
696ccc98-6b05-48ae-a4f9-0543d04423be-S49198 (meta-slave-test-3-82-50) 
unreachable: did not reregister within 10mins after master failover
F1215 02:20:22.541508 498591 master.cpp:9523] Check failed: 
slaves.recovered.contains(slave.id())
*** Check failure stack trace: ***
    @     0x7fcda8a90fdd  google::LogMessage::Fail()
    @     0x7fcda8a93263  google::LogMessage::SendToLog()
    @     0x7fcda8a90b59  google::LogMessage::Flush()
    @     0x7fcda8a93c69  google::LogMessageFatal::~LogMessageFatal()
    @     0x7fcda75d05d8  mesos::internal::master::Master::_markUnreachable()
    @     0x7fcda75d083d  (unknown)
    @     0x7fcda72b0f93  
_ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8internal8DispatchINS1_6FutureIbEEEclINS0_IFSC_vEEEEESC_RKNS1_4UPIDEOT_EUlSt10unique_ptrINS1_7PromiseIbEESt14default_deleteISO_EEOSG_S3_E_ISR_SG_St12_PlaceholderILi1EEEEEEclEOS3_
    @     0x7fcda89f68f1  process::ProcessBase::consume()
    @     0x7fcda8a0f09b  process::ProcessManager::resume()
    @     0x7fcda8a15986  (unknown)
    @     0x7fcda45ce070  (unknown)
    @     0x7fcda4c33ea5  start_thread
    @     0x7fcda3d318dd  __clone
{noformat}

[1] https://github.com/apache/mesos/blob/1.11.0/src/master/master.cpp#L8698
[2] https://github.com/apache/mesos/blob/1.11.0/src/master/master.cpp#L7110



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to