> On Jan. 17, 2017, 5:07 p.m., Jay Guo wrote: > > src/master/master.cpp, lines 7201-7203 > > <https://reviews.apache.org/r/55571/diff/1/?file=1605757#file1605757line7201> > > > > There may be some corner cases that causes this `CHECK` to coredump. > > For example: > > 1. Start a Mesos cluster with master, agent and a scheduler with single > > role `role1`. > > 2. Master failover > > 3. Agent re-register before scheduler does > > 4. Scheduler re-subscribe with multi-role `{role2}` > > > > In this case, master re-construct FrameworkInfo from re-registered > > agent, which contains old role. However, in current implementation, `Error` > > of `updateFrameworkInfo` is not caught for recovered frameworks. > > > > In theory, this test case shouldn't cause master coredump: > > ```cpp > > master::Flags masterFlags = CreateMasterFlags(); > > Try<Owned<cluster::Master>> master = StartMaster(masterFlags); > > ASSERT_SOME(master); > > > > MockExecutor exec(DEFAULT_EXECUTOR_ID); > > TestContainerizer containerizer(&exec); > > > > StandaloneMasterDetector slaveDetector(master.get()->pid); > > Try<Owned<cluster::Slave>> slave = StartSlave(&slaveDetector, > > &containerizer); > > ASSERT_SOME(slave); > > > > FrameworkInfo frameworkInfo = DEFAULT_FRAMEWORK_INFO; > > frameworkInfo.set_failover_timeout(1024); > > frameworkInfo.set_role("role1"); > > > > Future<FrameworkID> frameworkId; > > > > { > > MockScheduler sched; > > MesosSchedulerDriver driver( > > &sched, frameworkInfo, master.get()->pid, DEFAULT_CREDENTIAL); > > > > EXPECT_CALL(sched, registered(&driver, _, _)) > > .WillOnce(FutureArg<1>(&frameworkId)); > > > > Future<vector<Offer>> offers; > > EXPECT_CALL(sched, resourceOffers(&driver, _)) > > .WillOnce(FutureArg<1>(&offers)) > > .WillRepeatedly(Return()); // Ignore subsequent offers. > > > > driver.start(); > > > > Clock::pause(); > > Clock::settle(); > > Clock::resume(); > > > > AWAIT_READY(frameworkId); > > > > AWAIT_READY(offers); > > EXPECT_FALSE(offers->empty()); > > > > TaskInfo task; > > task.set_name(""); > > task.mutable_task_id()->set_value("1"); > > task.mutable_slave_id()->MergeFrom(offers.get()[0].slave_id()); > > task.mutable_resources()->MergeFrom(offers.get()[0].resources()); > > task.mutable_executor()->MergeFrom(DEFAULT_EXECUTOR_INFO); > > > > EXPECT_CALL(exec, registered(_, _, _, _)) > > .Times(1); > > > > EXPECT_CALL(exec, launchTask(_, _)) > > .WillOnce(SendStatusUpdateFromTask(TASK_RUNNING)); > > > > Future<TaskStatus> runningStatus; > > EXPECT_CALL(sched, statusUpdate(&driver, _)) > > .WillOnce(FutureArg<1>(&runningStatus)); > > > > driver.launchTasks(offers.get()[0].id(), {task}); > > > > AWAIT_READY(runningStatus); > > EXPECT_EQ(TASK_RUNNING, runningStatus.get().state()); > > EXPECT_EQ(task.task_id(), runningStatus.get().task_id()); > > } > > > > slaveDetector.appoint(None()); > > > > master->reset(); > > master = StartMaster(masterFlags); > > ASSERT_SOME(master); > > > > Future<SlaveReregisteredMessage> slaveReregisteredMessage = > > FUTURE_PROTOBUF(SlaveReregisteredMessage(), _, _); > > > > slaveDetector.appoint(master.get()->pid); > > > > AWAIT_READY(slaveReregisteredMessage); > > > > frameworkInfo.mutable_id()->CopyFrom(frameworkId.get()); > > frameworkInfo.add_roles("role2"); > > frameworkInfo.clear_role(); > > frameworkInfo.add_capabilities()->set_type( > > FrameworkInfo::Capability::MULTI_ROLE); > > > > MockScheduler sched; > > MesosSchedulerDriver driver( > > &sched, frameworkInfo, master.get()->pid, DEFAULT_CREDENTIAL); > > > > Future<Nothing> registered; > > EXPECT_CALL(sched, registered(&driver, _, _)) > > .WillOnce(FutureSatisfy(®istered)); > > > > driver.start(); > > > > Clock::pause(); > > Clock::settle(); > > Clock::resume(); > > > > AWAIT_READY(registered); > > > > EXPECT_CALL(exec, shutdown(_)) > > .Times(AtMost(1)); > > > > driver.stop(); > > driver.join(); > > ``` > > Benjamin Bannier wrote: > Thanks for spotting this, this is of course unacceptable. Also your test > case was helpful. I updated the patch to allow > `Master::activateRecoveredFramework` to fail.
Maybe you could add this test case as part of our test suite as well? - Jay ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/55571/#review161827 ----------------------------------------------------------- On Jan. 18, 2017, 1:13 a.m., Benjamin Bannier wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/55571/ > ----------------------------------------------------------- > > (Updated Jan. 18, 2017, 1:13 a.m.) > > > Review request for mesos, Benjamin Mahler, Jay Guo, and Guangya Liu. > > > Bugs: MESOS-6631 > https://issues.apache.org/jira/browse/MESOS-6631 > > > Repository: mesos > > > Description > ------- > > This allows us to confirm that an updated FrameworkInfo is fully > compatible with an already known one. By performing this check in > updateFrameworkInfo when can be sure that both the new and old > FrameworkInfo are available (e.g., after framework reregistration, or > master or framework failover). > > > Diffs > ----- > > src/master/master.hpp 44f4fecb1fbe8bebf830990a59a5462338e6e004 > src/master/master.cpp b863ff6e93931c3d1ee056248084c7f44caf2fd9 > > Diff: https://reviews.apache.org/r/55571/diff/ > > > Testing > ------- > > Tested as part of https://reviews.apache.org/r/55271/. > > > Thanks, > > Benjamin Bannier > >
