[ 
https://issues.apache.org/jira/browse/MESOS-8601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16375270#comment-16375270
 ] 

Greg Mann commented on MESOS-8601:
----------------------------------

{code}
commit b4e210678c04e57c2fa9f277b44f6d011da1846a (HEAD -> master, origin/master, 
origin/HEAD, merge)
Author: Chun-Hung Hsiao <chhs...@mesosphere.io>
Date:   Fri Feb 23 18:37:17 2018 -0800

    Added a master API test for agent re-registration after master failover.

    This test verifies that subscribing to the 'api/v1' endpoint between a
    master failover and an agent re-registration won't cause the master to
    crash.

    Review: https://reviews.apache.org/r/65775/
{code}
{code}
commit f2ec2b288e823424b2efe71d62ef90101b7a863f
Author: Chun-Hung Hsiao <chhs...@mesosphere.io>
Date:   Fri Feb 23 18:37:12 2018 -0800

    Fixed a master API bug for agent re-registration after master failover.

    When the master fails over and a client subscribes to the master before
    agent re-registration, the master will crash when sending `TASK_ADDED`
    because the framework info might not have been added to the master yet.
    This patch fixes this bug.

    Review: https://reviews.apache.org/r/65774/
{code}

> Master crashes during slave reregistration after failover.
> ----------------------------------------------------------
>
>                 Key: MESOS-8601
>                 URL: https://issues.apache.org/jira/browse/MESOS-8601
>             Project: Mesos
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 1.5.0
>            Reporter: Chun-Hung Hsiao
>            Assignee: Chun-Hung Hsiao
>            Priority: Blocker
>              Labels: master
>
> The following happened after a master failover.
> During slave reregistration, new tasks were added and the new leading master 
> notified all of its subscribers, and triggered the following check failure:
> {noformat}
> F0222 15:53:44.440387  2805 master.cpp:11190] Check failed: 'framework' Must 
> be non NULL
> *** Check failure stack trace: ***
> @     0x7f1357be521d  google::LogMessage::Fail()
> @     0x7f1357be704d  google::LogMessage::SendToLog()
> @     0x7f1357be4e0c  google::LogMessage::Flush()
> @     0x7f1357be7949  google::LogMessageFatal::~LogMessageFatal()
> @     0x7f1356c80e2d  google::CheckNotNull<>()
> @     0x7f1356ce2666  mesos::internal::master::Master::Subscribers::send()
> @     0x7f1356cece83  mesos::internal::master::Slave::addTask()
> @     0x7f1356cf3206  mesos::internal::master::Slave::Slave()
> @     0x7f1356cf5b90  mesos::internal::master::Master::__reregisterSlave()
> @     0x7f1356d02cf8  mesos::internal::master::Master::_reregisterSlave()
> @     0x7f1357b43761  process::ProcessBase::consume()
> @     0x7f1357b5248c  process::ProcessManager::resume()
> @     0x7f1357b579f6  
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
> @     0x7f1354e6c230  (unknown)
> @     0x7f135468ae25  start_thread
> @     0x7f13543b834d  __clone
> {noformat}
> This was because the master tried to get the framework info when sending the 
> notification: 
> https://github.com/apache/mesos/blob/1.5.x/src/master/master.cpp#L11190
> But it added the framework after that:
> https://github.com/apache/mesos/blob/1.5.x/src/master/master.cpp#L6963



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to