[jira] [Commented] (MESOS-10146) Removing task from slave when framework is disconnected causes master to crash

Charles Natali (Jira) Mon, 25 Jan 2021 16:21:07 -0800


    [ 
https://issues.apache.org/jira/browse/MESOS-10146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17271739#comment-17271739
 ]


Charles Natali commented on MESOS-10146:
----------------------------------------

Looking at the 1.9.0 code I think I found what caused this, however looking at 
master I believe it's been fixed by this commit: 
[https://github.com/apache/mesos/commit/6be17200b8084ad3524e7d450c411765b3214c0f]

for this issue: 
[https://issues.apache.org/jira/projects/MESOS/issues/MESOS-9609|https://issues.apache.org/jira/projects/MESOS/issues/MESOS-9609?filter=allissues]

 

So I think this can be closed as duplicate of #9609.

 

 

 

 

> Removing task from slave when framework is disconnected causes master to crash
> ------------------------------------------------------------------------------
>
>                 Key: MESOS-10146
>                 URL: https://issues.apache.org/jira/browse/MESOS-10146
>             Project: Mesos
>          Issue Type: Bug
>          Components: c++ api, framework
>    Affects Versions: 1.9.0
>         Environment: Mesos master with three master nodes
>            Reporter: Naveen
>            Priority: Blocker
>
> Hello, 
>     we want to report an issue we observed when remove tasks from slave. 
> There is condition to check for valid framework before tasks can be removed. 
> There can be several reasons framework can be disconnected. This check fails 
> and crashes mesos master node. 
> [https://github.com/apache/mesos/blob/1.9.0/src/master/master.cpp#L11842]
> There is also unguarded access to the internal framework state on line 11853.
> Error logs - 
> {noformat}
> mesos-master[5483]: I0618 14:05:20.859189 5491 master.cpp:9512] Marked agent 
> 3c26f984-5adb-48f8-a656-3dfba1f9f0c1-S303 (10.160.73.79) unreachable: health 
> check timed out
> mesos-master[5483]: F0618 14:05:20.859347 5491 master.cpp:11842] Check 
> failed: framework != nullptr Framework 
> 3c26f984-5adb-48f8-a656-3dfba1f9f0c1-0067 not found while removing agent 
> 3c26f984-5adb-48f8-a656-3dfba1f9f0c1-S303 at slave(1)@10.160.73.79:5051 
> (10.160.73.79); agent tasks: { 3c26f984-5adb-48f8-a656-3dfba1f9f0c1-0067: { } 
> }
> mesos-master[5483]: *** Check failure stack trace: ***
> mesos-master[5483]: I0618 14:05:20.859781 5490 hierarchical.cpp:1013] Removed 
> all filters for agent 3c26f984-5adb-48f8-a656-3dfba1f9f0c1-S303
> mesos-master[5483]: I0618 14:05:20.872217 5490 hierarchical.cpp:890] Removed 
> agent 3c26f984-5adb-48f8-a656-3dfba1f9f0c1-S303
> mesos-master[5483]: I0618 14:05:20.859922 5487 replica.cpp:695] Replica 
> received learned notice for position 42070 from 
> log-network(1)@10.160.73.212:5050
> mesos-master[5483]: @ 0x7f2fdf6a5b1d google::LogMessage::Fail()
> mesos-master[5483]: @ 0x7f2fdf6a7dfd google::LogMessage::SendToLog()
> mesos-master[5483]: @ 0x7f2fdf6a56ab google::LogMessage::Flush()
> mesos-master[5483]: @ 0x7f2fdf6a8859 
> google::LogMessageFatal::~LogMessageFatal()
> mesos-master[5483]: @ 0x7f2fde2677f2 
> mesos::internal::master::Master::__removeSlave()
> mesos-master[5483]: @ 0x7f2fde267ebe 
> mesos::internal::master::Master::_markUnreachable()
> mesos-master[5483]: @ 0x7f2fde268215 
> _ZNO6lambda12CallableOnceIFN7process6FutureIbEEvEE10CallableFnINS_8internal7PartialIZN5mesos8internal6master6Master15markUnreachableERKNS9_9SlaveInfoEbRKSsEUlbE_JbEEEEclEv
> mesos-master[5483]: @ 0x7f2fddf30688 
> _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8internal8DispatchINS1_6FutureIbEEEclINS0_IFSC_vEEEEESC_RKNS1_4UPIDEOT_EUlSt10unique_ptrINS1_7PromiseIbEESt14default_deleteISO_EEOSG_S3_E_ISR_SG_St12_PlaceholderILi1EEEEEEclEOS3_
> mesos-master[5483]: @ 0x7f2fdf5e3b91 process::ProcessBase::consume()
> mesos-master[5483]: @ 0x7f2fdf608f77 process::ProcessManager::resume()
> mesos-master[5483]: @ 0x7f2fdf60cb36 
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
> mesos-master[5483]: @ 0x7f2fdf8c34d0 execute_native_thread_routine
> mesos-master[5483]: @ 0x7f2fdba02ea5 start_thread
> mesos-master[5483]: @ 0x7f2fdb20e8dd __clone
> systemd[1]: mesos-master.service: main process exited, code=killed, 
> status=6/ABRT
> systemd[1]: Unit mesos-master.service entered failed state.
> systemd[1]: mesos-master.service failed.
> systemd[1]: mesos-master.service holdoff time over, scheduling restart.
> systemd[1]: Stopped Mesos Master.
> systemd[1]: Started Mesos Master.
> mesos-master[28757]: I0618 14:05:41.461403 28748 logging.cpp:201] INFO level 
> logging started!
> mesos-master[28757]: I0618 14:05:41.461712 28748 main.cpp:243] Build: 
> 2020-05-09 10:42:00 by centos
> mesos-master[28757]: I0618 14:05:41.461721 28748 main.cpp:244] Version: 1.9.0
> mesos-master[28757]: I0618 14:05:41.461726 28748 main.cpp:247] Git tag: 1.9.0
> mesos-master[28757]: I0618 14:05:41.461730 28748 main.cpp:251] Git SHA: 
> 5e79a584e6ec3e9e2f96e8bf418411df9dafac2e{noformat}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (MESOS-10146) Removing task from slave when framework is disconnected causes master to crash

Reply via email to