[ 
https://issues.apache.org/jira/browse/MESOS-6226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu updated MESOS-6226:
--------------------------
    Priority: Critical  (was: Major)

> Master crashes while transitioning tasks to 'TASK_UNREACHABLE'
> --------------------------------------------------------------
>
>                 Key: MESOS-6226
>                 URL: https://issues.apache.org/jira/browse/MESOS-6226
>             Project: Mesos
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 1.1.0
>            Reporter: Jan Schlicht
>            Assignee: Neil Conway
>            Priority: Critical
>             Fix For: 1.1.0
>
>
> While marking an agent as unreachable, Mesos master crashes when marking 
> tasks as unreachable.
> A possible reason could be that a framework might not be reregistered yet.
> {noformat}
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.207105  
> 1043 master.cpp:236] Scheduling transition of agent 
> 5f0106ed-1033-4624-ba1b-99f81c7e348a-S1 to UNREACHABLE because of health 
> check timeout
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.217079  
> 1043 master.cpp:5849] Marking agent 5f0106ed-1033-4624-ba1b-99f81c7e348a-S1 
> at slave(1)@10.10.0.205:5051 (10.10.0.205) unreachable: health check timed out
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.217375  
> 1043 registrar.cpp:461] Applied 1 operations in 97324ns; attempting to update 
> the registry
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.217798  
> 1041 coordinator.cpp:348] Coordinator attempting to write APPEND action at 
> position 2681
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.218132  
> 1041 replica.cpp:537] Replica received write request for position 2681 from 
> __req_res__(35)@10.10.0.212:5050
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: E0922 06:16:16.218492  
> 1049 process.cpp:2113] Failed to shutdown socket with fd 74: Transport 
> endpoint is not connected
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.220237  
> 1041 replica.cpp:691] Replica received learned notice for position 2681 from 
> @0.0.0.0:0
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.222190  
> 1041 registrar.cpp:506] Successfully updated the registry in 4.75008ms
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.222304  
> 1043 coordinator.cpp:348] Coordinator attempting to write TRUNCATE action at 
> position 2682
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.222390  
> 1043 master.cpp:5896] Marked agent 5f0106ed-1033-4624-ba1b-99f81c7e348a-S1 
> (10.10.0.205) unreachable: health check timed out
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.222456  
> 1043 master.cpp:7537] Updating the state of task 8 of framework 
> 2505868e-c750-4aed-8413-99810a59ffd8-0001 (latest state: TASK_LOST, status 
> update state: TASK_LOST)
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.222558  
> 1043 master.cpp:7633] Removing task 8 with resources cpus(*):0.2; mem(*):32 
> of framework 2505868e-c750-4aed-8413-99810a59ffd8-0001 on agent 
> 5f0106ed-1033-4624-ba1b-99f81c7e348a-S1 at slave(1)@10.10.0.205:5051 
> (10.10.0.205)
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.222631  
> 1043 master.cpp:5695] Sending status update TASK_LOST for task 8 of framework 
> 2505868e-c750-4aed-8413-99810a59ffd8-0001 'Slave 10.10.0.205 is unreachable'
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.222651  
> 1041 hierarchical.cpp:514] Removed agent 
> 5f0106ed-1033-4624-ba1b-99f81c7e348a-S1
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: F0922 06:16:16.222733  
> 1043 master.cpp:5918] Check failed: 'framework' Must be non NULL
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: *** Check failure stack 
> trace: ***
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.222829  
> 1041 replica.cpp:537] Replica received write request for position 2682 from 
> __req_res__(38)@10.10.0.212:5050
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: @     0x7f5fc2ba979d  
> google::LogMessage::Fail()
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: @     0x7f5fc2bab5cd  
> google::LogMessage::SendToLog()
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.248742  
> 1041 replica.cpp:691] Replica received learned notice for position 2682 from 
> @0.0.0.0:0
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: @     0x7f5fc2ba938c  
> google::LogMessage::Flush()
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: @     0x7f5fc2babec9  
> google::LogMessageFatal::~LogMessageFatal()
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: @     0x7f5fc21dd52d  
> google::CheckNotNull<>()
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: @     0x7f5fc21c632c  
> mesos::internal::master::Master::_markUnreachable()
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: @     0x7f5fc220c88b  
> _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos8internal6master6MasterEPNS7_5SlaveENS5_8TimeInfoERKNS0_6FutureIbEESA_SB_SD_EEvRKNS0_3PIDIT_EEMSH_FvT0_T1_T2_ET3_T4_T5_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: @     0x7f5fc2b305d1  
> process::ProcessManager::resume()
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: @     0x7f5fc2b308d7  
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: @     0x7f5fc11e1220  
> (unknown)
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: @     0x7f5fc09ffdc5  
> start_thread
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: @     0x7f5fc072cced  
> __clone
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to