Jan Schlicht created MESOS-6226:
-----------------------------------

             Summary: Master crashes while transitioning tasks to 
'TASK_UNREACHABLE'
                 Key: MESOS-6226
                 URL: https://issues.apache.org/jira/browse/MESOS-6226
             Project: Mesos
          Issue Type: Bug
          Components: master
    Affects Versions: 1.1.0
            Reporter: Jan Schlicht
            Assignee: Neil Conway


While marking an agent as unreachable, Mesos master crashes when marking tasks 
as unreachable.
A possible reason could be that a framework might not be reregistered yet.

{noformat}
Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.207105  1043 
master.cpp:236] Scheduling transition of agent 
5f0106ed-1033-4624-ba1b-99f81c7e348a-S1 to UNREACHABLE because of health check 
timeout
Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.217079  1043 
master.cpp:5849] Marking agent 5f0106ed-1033-4624-ba1b-99f81c7e348a-S1 at 
slave(1)@10.10.0.205:5051 (10.10.0.205) unreachable: health check timed out
Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.217375  1043 
registrar.cpp:461] Applied 1 operations in 97324ns; attempting to update the 
registry
Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.217798  1041 
coordinator.cpp:348] Coordinator attempting to write APPEND action at position 
2681
Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.218132  1041 
replica.cpp:537] Replica received write request for position 2681 from 
__req_res__(35)@10.10.0.212:5050
Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: E0922 06:16:16.218492  1049 
process.cpp:2113] Failed to shutdown socket with fd 74: Transport endpoint is 
not connected
Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.220237  1041 
replica.cpp:691] Replica received learned notice for position 2681 from 
@0.0.0.0:0
Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.222190  1041 
registrar.cpp:506] Successfully updated the registry in 4.75008ms
Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.222304  1043 
coordinator.cpp:348] Coordinator attempting to write TRUNCATE action at 
position 2682
Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.222390  1043 
master.cpp:5896] Marked agent 5f0106ed-1033-4624-ba1b-99f81c7e348a-S1 
(10.10.0.205) unreachable: health check timed out
Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.222456  1043 
master.cpp:7537] Updating the state of task 8 of framework 
2505868e-c750-4aed-8413-99810a59ffd8-0001 (latest state: TASK_LOST, status 
update state: TASK_LOST)
Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.222558  1043 
master.cpp:7633] Removing task 8 with resources cpus(*):0.2; mem(*):32 of 
framework 2505868e-c750-4aed-8413-99810a59ffd8-0001 on agent 
5f0106ed-1033-4624-ba1b-99f81c7e348a-S1 at slave(1)@10.10.0.205:5051 
(10.10.0.205)
Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.222631  1043 
master.cpp:5695] Sending status update TASK_LOST for task 8 of framework 
2505868e-c750-4aed-8413-99810a59ffd8-0001 'Slave 10.10.0.205 is unreachable'
Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.222651  1041 
hierarchical.cpp:514] Removed agent 5f0106ed-1033-4624-ba1b-99f81c7e348a-S1
Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: F0922 06:16:16.222733  1043 
master.cpp:5918] Check failed: 'framework' Must be non NULL
Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: *** Check failure stack 
trace: ***
Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.222829  1041 
replica.cpp:537] Replica received write request for position 2682 from 
__req_res__(38)@10.10.0.212:5050
Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: @     0x7f5fc2ba979d  
google::LogMessage::Fail()
Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: @     0x7f5fc2bab5cd  
google::LogMessage::SendToLog()
Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.248742  1041 
replica.cpp:691] Replica received learned notice for position 2682 from 
@0.0.0.0:0
Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: @     0x7f5fc2ba938c  
google::LogMessage::Flush()
Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: @     0x7f5fc2babec9  
google::LogMessageFatal::~LogMessageFatal()
Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: @     0x7f5fc21dd52d  
google::CheckNotNull<>()
Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: @     0x7f5fc21c632c  
mesos::internal::master::Master::_markUnreachable()
Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: @     0x7f5fc220c88b  
_ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos8internal6master6MasterEPNS7_5SlaveENS5_8TimeInfoERKNS0_6FutureIbEESA_SB_SD_EEvRKNS0_3PIDIT_EEMSH_FvT0_T1_T2_ET3_T4_T5_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_
Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: @     0x7f5fc2b305d1  
process::ProcessManager::resume()
Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: @     0x7f5fc2b308d7  
_ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv
Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: @     0x7f5fc11e1220  
(unknown)
Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: @     0x7f5fc09ffdc5  
start_thread
Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: @     0x7f5fc072cced  __clone
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to