[ 
https://issues.apache.org/jira/browse/MESOS-7991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16215386#comment-16215386
 ] 

Alexander Rukletsov edited comment on MESOS-7991 at 11/6/17 10:44 AM:
----------------------------------------------------------------------

This could happen if we have master failover, agent re-registers and then again 
re-registers 
(https://github.com/apache/mesos/blob/b13c4c3683fd6bad702a7fb9e24cfc3414b921da/src/slave/slave.cpp#L1629).
 The statement in 
https://github.com/apache/mesos/blob/b13c4c3683fd6bad702a7fb9e24cfc3414b921da/src/master/master.cpp#L8070
 thus does not seem correct and the change 
https://github.com/apache/mesos/blob/b13c4c3683fd6bad702a7fb9e24cfc3414b921da/src/master/master.cpp#L8073
 from the review request https://reviews.apache.org/r/53897/ that happened to 
follow this comment should be removed.

The strange thing is that the tasks are known to the master but not to the 
agent according to the logs (master.cpp:7568), the fact that the agent kept its 
id but not its tasks seems unlikely. [~drribosome] Could you give more context 
around the agent, the registration attempt and also the master logs since the 
failover and the agent logs around that timeframe?

We should write a test reproducing the issue -(having a master + agent, 
launching a task, restarting master, block framework re-registration, let agent 
re-registers twice by spoofing the second re-registration)- and then remove the 
line 8073.


was (Author: armandgrillet):
This could happen if we have master failover, agent re-registers and then again 
re-registers 
(https://github.com/apache/mesos/blob/b13c4c3683fd6bad702a7fb9e24cfc3414b921da/src/slave/slave.cpp#L1629).
 The statement in 
https://github.com/apache/mesos/blob/b13c4c3683fd6bad702a7fb9e24cfc3414b921da/src/master/master.cpp#L8070
 thus does not seem correct and the change 
https://github.com/apache/mesos/blob/b13c4c3683fd6bad702a7fb9e24cfc3414b921da/src/master/master.cpp#L8073
 from the review request https://reviews.apache.org/r/53897/ that happened to 
follow this comment should be removed.

The strange thing is that the tasks are known to the master but not to the 
agent according to the logs (master.cpp:7568), the fact that the agent kept its 
id but not its tasks seems unlikely. Could you give more context around the 
agent, the registration attempt and also the master logs since the failover and 
the agent logs around that timeframe?

We should write a test reproducing the issue -(having a master + agent, 
launching a task, restarting master, block framework re-registration, let agent 
re-registers twice by spoofing the second re-registration)- and then remove the 
line 8073.

> fatal, check failed !framework->recovered()
> -------------------------------------------
>
>                 Key: MESOS-7991
>                 URL: https://issues.apache.org/jira/browse/MESOS-7991
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: Jack Crawford
>            Assignee: Armand Grillet
>            Priority: Blocker
>              Labels: reliability
>
> mesos master crashed on what appears to be framework recovery
> mesos master version: 1.3.1
> mesos agent version: 1.3.1
> {code}
> W0920 14:58:54.756364 25452 master.cpp:7568] Task 
> 862181ec-dffb-4c03-8807-5fb4c4e9a907 of framework 
> 889aae9d-1aab-4268-ba42-9d5c2461d871 unknown to the agent 
> a498d458-bbca-426e-b076-b328f5b035da-S5225 at slave(1)
> @10.0.239.217:5051 (ip-10-0-239-217) during re-registration: reconciling with 
> the agent
> W0920 14:58:54.756369 25452 master.cpp:7568] Task 
> 9c21c48a-63ad-4d58-9e22-f720af19a644 of framework 
> 889aae9d-1aab-4268-ba42-9d5c2461d871 unknown to the agent 
> a498d458-bbca-426e-b076-b328f5b035da-S5225 at slave(1)
> @10.0.239.217:5051 (ip-10-0-239-217) during re-registration: reconciling with 
> the agent
> W0920 14:58:54.756376 25452 master.cpp:7568] Task 
> 05c451f8-c48a-47bd-a235-0ceb9b3f8d0c of framework 
> 889aae9d-1aab-4268-ba42-9d5c2461d871 unknown to the agent 
> a498d458-bbca-426e-b076-b328f5b035da-S5225 at slave(1)
> @10.0.239.217:5051 (ip-10-0-239-217) during re-registration: reconciling with 
> the agent
> W0920 14:58:54.756381 25452 master.cpp:7568] Task 
> e8641b1f-f67f-42fe-821c-09e5a290fc60 of framework 
> 889aae9d-1aab-4268-ba42-9d5c2461d871 unknown to the agent 
> a498d458-bbca-426e-b076-b328f5b035da-S5225 at slave(1)
> @10.0.239.217:5051 (ip-10-0-239-217) during re-registration: reconciling with 
> the agent
> W0920 14:58:54.756386 25452 master.cpp:7568] Task 
> f838a03c-5cd4-47eb-8606-69b004d89808 of framework 
> 889aae9d-1aab-4268-ba42-9d5c2461d871 unknown to the agent 
> a498d458-bbca-426e-b076-b328f5b035da-S5225 at slave(1)
> @10.0.239.217:5051 (ip-10-0-239-217) during re-registration: reconciling with 
> the agent
> W0920 14:58:54.756392 25452 master.cpp:7568] Task 
> 685ca5da-fa24-494d-a806-06e03bbf00bd of framework 
> 889aae9d-1aab-4268-ba42-9d5c2461d871 unknown to the agent 
> a498d458-bbca-426e-b076-b328f5b035da-S5225 at slave(1)
> @10.0.239.217:5051 (ip-10-0-239-217) during re-registration: reconciling with 
> the agent
> W0920 14:58:54.756397 25452 master.cpp:7568] Task 
> 65ccf39b-5c46-4121-9fdd-21570e8068e6 of framework 
> 889aae9d-1aab-4268-ba42-9d5c2461d871 unknown to the agent 
> a498d458-bbca-426e-b076-b328f5b035da-S5225 at slave(1)
> @10.0.239.217:5051 (ip-10-0-239-217) during re-registration: reconciling with 
> the agent
> F0920 14:58:54.756404 25452 master.cpp:7601] Check failed: 
> !framework->recovered()
> *** Check failure stack trace: ***
>     @     0x7f7bf80087ed  google::LogMessage::Fail()
>     @     0x7f7bf800a5a0  google::LogMessage::SendToLog()
>     @     0x7f7bf80083d3  google::LogMessage::Flush()
>     @     0x7f7bf800afc9  google::LogMessageFatal::~LogMessageFatal()
>     @     0x7f7bf736fe7e  
> mesos::internal::master::Master::reconcileKnownSlave()
>     @     0x7f7bf739e612  mesos::internal::master::Master::_reregisterSlave()
>     @     0x7f7bf73a580e  
> _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos8internal6master6MasterERKNS5_9SlaveInfoERKNS0_4UPIDERK6OptionINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIc
> EEEERKSt6vectorINS5_8ResourceESaISQ_EERKSP_INS5_12ExecutorInfoESaISV_EERKSP_INS5_4TaskESaIS10_EERKSP_INS5_13FrameworkInfoESaIS15_EERKSP_INS6_17Archive_FrameworkESaIS1A_EERKSL_RKSP_INS5_20SlaveInfo_CapabilityESaIS
> 1H_EERKNS0_6FutureIbEES9_SC_SM_SS_SX_S12_S17_S1C_SL_S1J_S1N_EEvRKNS0_3PIDIT_EEMS1R_FvT0_T1_T2_T3_T4_T5_T6_T7_T8_T9_T10_ET11_T12_T13_T14_T15_T16_T17_T18_T19_T20_T21_EUlS2_E_E9_M_invokeERKSt9_Any_dataOS2_
>     @     0x7f7bf7f5e69c  process::ProcessBase::visit()
>     @     0x7f7bf7f71403  process::ProcessManager::resume()
>     @     0x7f7bf7f7c127  
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv
>     @     0x7f7bf60b5c80  (unknown)
>     @     0x7f7bf58c86ba  start_thread
>     @     0x7f7bf55fe3dd  (unknown)
> mesos-master.service: Main process exited, code=killed, status=6/ABRT
> mesos-master.service: Unit entered failed state.
> mesos-master.service: Failed with result 'signal'.
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to