[ 
https://issues.apache.org/jira/browse/MESOS-7389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15968051#comment-15968051
 ] 

Neil Conway commented on MESOS-7389:
------------------------------------

Interesting. Basic logic here:

* Agent is re-registering with the master
* The agent reports a list of the tasks it is running, and the frameworks that 
are running tasks on it
* The assertion fires because there is a task running on the agent with a 
framework ID that is not in the list of frameworks the agent reported.

Pre-1.0 Mesos agents _only_ report the tasks they are running, not the list of 
frameworks. Connecting pre-1.0 Mesos agents to 1.2.0 Mesos master is not 
_technically_ supported, but we don't actually guard against it just yet. So if 
the Mesos agent was actually running some pre-1.0 version of Mesos, that would 
explain the problem.

If the agent was in fact running Mesos 1.0.1, something else is going on here.

[~nicholasstudt] -- can you confirm that the agent in question was definitely 
running Mesos 1.0.1 when the problem was observed?

> Check failed: frameworks_.contains(task.framework_id())
> -------------------------------------------------------
>
>                 Key: MESOS-7389
>                 URL: https://issues.apache.org/jira/browse/MESOS-7389
>             Project: Mesos
>          Issue Type: Bug
>    Affects Versions: 1.2.0
>         Environment: Ubuntu 14.04 
>            Reporter: Nicholas Studt
>
> During upgrade from 1.0.1 to 1.2.0 a single mesos-slave reregistering with 
> the running leader caused the leader to terminate. All 3 of the masters 
> suffered the same failure as the same slave node reregistered against the new 
> leader, this continued across the entire cluster until the offending slave 
> node was removed and fixed. The fix to the slave node was to remove the mesos 
> directory and then start the slave node back up. 
>  F0412 17:24:42.736600  6317 master.cpp:5701] Check failed: 
> frameworks_.contains(task.framework_id())
>  *** Check failure stack trace: ***
>      @     0x7f59f944f94d  google::LogMessage::Fail()
>      @     0x7f59f945177d  google::LogMessage::SendToLog()
>      @     0x7f59f944f53c  google::LogMessage::Flush()
>      @     0x7f59f9452079  google::LogMessageFatal::~LogMessageFatal()
>  I0412 17:24:42.750300  6316 replica.cpp:693] Replica received learned notice 
> for position 6896 from @0.0.0.0:0 
>      @     0x7f59f88f2341  mesos::internal::master::Master::_reregisterSlave()
>      @     0x7f59f88f488f  
> _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos8internal6master6MasterERKNS5_9SlaveInfoERKNS0_4UPIDERKSt6vectorINS5_8ResourceESaISG_EERKSF_INS5_12ExecutorInfoESaISL_EERKSF_INS5_4TaskESaISQ_EERKSF_INS5_13FrameworkInfoESaISV_EERKSF_INS6_17Archive_FrameworkESaIS10_EERKSsRKSF_INS5_20SlaveInfo_CapabilityESaIS17_EERKNS0_6FutureIbEES9_SC_SI_SN_SS_SX_S12_SsS19_S1D_EEvRKNS0_3PIDIT_EEMS1H_FvT0_T1_T2_T3_T4_T5_T6_T7_T8_T9_ET10_T11_T12_T13_T14_T15_T16_T17_T18_T19_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_
>      @     0x7f59f93c3eb1  process::ProcessManager::resume()
>      @     0x7f59f93ccd57  
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv
>      @     0x7f59f77cfa60  (unknown)
>      @     0x7f59f6fec184  start_thread
>      @     0x7f59f6d19bed  (unknown)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to