[jira] [Updated] (MESOS-4831) Master sometimes sends two inverse offers after the agent goes into maintenance.

Anand Mazumdar (JIRA) Tue, 01 Mar 2016 16:30:33 -0800

     [ 
https://issues.apache.org/jira/browse/MESOS-4831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Anand Mazumdar updated MESOS-4831:
----------------------------------
    Description: 
Showed up on ASF CI for {{MasterMaintenanceTest.PendingUnavailabilityTest}}

https://builds.apache.org/job/Mesos/1748/COMPILER=gcc,CONFIGURATION=--verbose,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu:14.04,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)/consoleFull

{code}
I0229 11:08:57.027559   668 hierarchical.cpp:1437] No resources available to 
allocate!
I0229 11:08:57.027745   668 hierarchical.cpp:1150] Performed allocation for 
slave fd39ca89-d7fd-4df8-ad50-dbb493d1cd7b-S0 in 272747ns
I0229 11:08:57.027757   675 master.cpp:5369] Sending 1 offers to framework 
fd39ca89-d7fd-4df8-ad50-dbb493d1cd7b-0000 (default)
I0229 11:08:57.028586   675 master.cpp:5459] Sending 1 inverse offers to 
framework fd39ca89-d7fd-4df8-ad50-dbb493d1cd7b-0000 (default)
I0229 11:08:57.029039   675 master.cpp:5459] Sending 1 inverse offers to 
framework fd39ca89-d7fd-4df8-ad50-dbb493d1cd7b-0000 (default)
{code}

The ideal expected workflow for this test is something like:

- The framework receives offers from master.
- The framework updates its maintenance schedule.
- The current offer is rescinded.
- A new offer is received from the master with unavailability set.
- After the agent goes for maintenance, an inverse offer is sent.

For some reason, in the logs we see that the master is sending 2 inverse 
offers. The test seems to pass as we just check for the initial inverse offer 
being present. This can also be reproduced by a modified version of the 
original test.

{code}
// Test ensures that an offer will have an `unavailability` set if the
// slave is scheduled to go down for maintenance.
TEST_F(MasterMaintenanceTest, PendingUnavailabilityTest)
{
  Try<PID<Master>> master = StartMaster();
  ASSERT_SOME(master);

  MockExecutor exec(DEFAULT_EXECUTOR_ID);

  Try<PID<Slave>> slave = StartSlave(&exec);
  ASSERT_SOME(slave);

  auto scheduler = std::make_shared<MockV1HTTPScheduler>();

  EXPECT_CALL(*scheduler, heartbeat(_))
    .WillRepeatedly(Return()); // Ignore heartbeats.

  Future<Nothing> connected;
  EXPECT_CALL(*scheduler, connected(_))
    .WillOnce(FutureSatisfy(&connected))
    .WillRepeatedly(Return()); // Ignore future invocations.

  scheduler::TestV1Mesos mesos(master.get(), ContentType::PROTOBUF, scheduler);

  AWAIT_READY(connected);

  Future<Event::Subscribed> subscribed;
  EXPECT_CALL(*scheduler, subscribed(_, _))
    .WillOnce(FutureArg<1>(&subscribed));

  Future<Event::Offers> normalOffers;
  Future<Event::Offers> unavailabilityOffers;
  Future<Event::Offers> inverseOffers;
  EXPECT_CALL(*scheduler, offers(_, _))
    .WillOnce(FutureArg<1>(&normalOffers))
    .WillOnce(FutureArg<1>(&unavailabilityOffers))
    .WillOnce(FutureArg<1>(&inverseOffers));

  // The original offers should be rescinded when the unavailability is changed.
  Future<Nothing> offerRescinded;
  EXPECT_CALL(*scheduler, rescind(_, _))
    .WillOnce(FutureSatisfy(&offerRescinded));

  {
    Call call;
    call.set_type(Call::SUBSCRIBE);

    Call::Subscribe* subscribe = call.mutable_subscribe();
    subscribe->mutable_framework_info()->CopyFrom(DEFAULT_V1_FRAMEWORK_INFO);

    mesos.send(call);
  }

  AWAIT_READY(subscribed);

  v1::FrameworkID frameworkId(subscribed->framework_id());

  AWAIT_READY(normalOffers);
  EXPECT_NE(0, normalOffers->offers().size());

  // Regular offers shouldn't have unavailability.
  foreach (const v1::Offer& offer, normalOffers->offers()) {
    EXPECT_FALSE(offer.has_unavailability());
  }

  // Schedule this slave for maintenance.
  MachineID machine;
  machine.set_hostname(maintenanceHostname);
  machine.set_ip(stringify(slave.get().address.ip));

  const Time start = Clock::now() + Seconds(60);
  const Duration duration = Seconds(120);
  const Unavailability unavailability = createUnavailability(start, duration);

  // Post a valid schedule with one machine.
  maintenance::Schedule schedule = createSchedule(
      {createWindow({machine}, unavailability)});

  // We have a few seconds between the first set of offers and the
  // next allocation of offers. This should be enough time to perform
  // a maintenance schedule update. This update will also trigger the
  // rescinding of offers from the scheduled slave.
  Future<Response> response = process::http::post(
      master.get(),
      "maintenance/schedule",
      headers,
      stringify(JSON::protobuf(schedule)));

  AWAIT_EXPECT_RESPONSE_STATUS_EQ(OK().status, response);

  // The original offers should be rescinded when the unavailability
  // is changed.
  AWAIT_READY(offerRescinded);

  AWAIT_READY(unavailabilityOffers);
  EXPECT_NE(0, unavailabilityOffers->offers().size());

  // Make sure the new offers have the unavailability set.
  foreach (const v1::Offer& offer, unavailabilityOffers->offers()) {
    EXPECT_TRUE(offer.has_unavailability());
    EXPECT_EQ(
        unavailability.start().nanoseconds(),
        offer.unavailability().start().nanoseconds());

    EXPECT_EQ(
        unavailability.duration().nanoseconds(),
        offer.unavailability().duration().nanoseconds());
  }

  // We also expect an inverse offer for the slave to go under
  // maintenance.
  AWAIT_READY(inverseOffers);
  EXPECT_NE(0, inverseOffers->inverse_offers().size());

  EXPECT_CALL(exec, shutdown(_))
    .Times(AtMost(1));

  EXPECT_CALL(*scheduler, disconnected(_))
    .Times(AtMost(1));

  Shutdown(); // Must shutdown before 'containerizer' gets deallocated.
}
{code}

Also, unrelated, we need to clean up this test to not expect multiple offers 
i.e. remove {{numberOfOffers}} constant.

  was:
Showed up on ASF CI for {{MasterMaintenanceTest.PendingUnavailabilityTest}}

https://builds.apache.org/job/Mesos/1748/COMPILER=gcc,CONFIGURATION=--verbose,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu:14.04,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)/consoleFull

{code}
I0229 11:08:57.027559   668 hierarchical.cpp:1437] No resources available to 
allocate!
I0229 11:08:57.027745   668 hierarchical.cpp:1150] Performed allocation for 
slave fd39ca89-d7fd-4df8-ad50-dbb493d1cd7b-S0 in 272747ns
I0229 11:08:57.027757   675 master.cpp:5369] Sending 1 offers to framework 
fd39ca89-d7fd-4df8-ad50-dbb493d1cd7b-0000 (default)
I0229 11:08:57.028586   675 master.cpp:5459] Sending 1 inverse offers to 
framework fd39ca89-d7fd-4df8-ad50-dbb493d1cd7b-0000 (default)
I0229 11:08:57.029039   675 master.cpp:5459] Sending 1 inverse offers to 
framework fd39ca89-d7fd-4df8-ad50-dbb493d1cd7b-0000 (default)
{code}

The ideal expected workflow for this test is something like:

- The framework receives offers from master.
- The framework updates its maintenance schedule.
- The current offer is rescinded.
- A new offer is received from the master with unavailability set.
- After the agent goes for maintenance, an inverse offer is sent.

For some reason, in the logs we see that the master is sending 2 inverse 
offers. The test seems to pass as we just check for the initial inverse offer 
being present. 

Also, unrelated, we need to clean up this test to not expect multiple offers 
i.e. remove {{numberOfOffers}} constant.


> Master sometimes sends two inverse offers after the agent goes into 
> maintenance.
> --------------------------------------------------------------------------------
>
>                 Key: MESOS-4831
>                 URL: https://issues.apache.org/jira/browse/MESOS-4831
>             Project: Mesos
>          Issue Type: Bug
>    Affects Versions: 0.27.0
>            Reporter: Anand Mazumdar
>              Labels: maintenance, mesosphere
>
> Showed up on ASF CI for {{MasterMaintenanceTest.PendingUnavailabilityTest}}
> https://builds.apache.org/job/Mesos/1748/COMPILER=gcc,CONFIGURATION=--verbose,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu:14.04,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)/consoleFull
> {code}
> I0229 11:08:57.027559   668 hierarchical.cpp:1437] No resources available to 
> allocate!
> I0229 11:08:57.027745   668 hierarchical.cpp:1150] Performed allocation for 
> slave fd39ca89-d7fd-4df8-ad50-dbb493d1cd7b-S0 in 272747ns
> I0229 11:08:57.027757   675 master.cpp:5369] Sending 1 offers to framework 
> fd39ca89-d7fd-4df8-ad50-dbb493d1cd7b-0000 (default)
> I0229 11:08:57.028586   675 master.cpp:5459] Sending 1 inverse offers to 
> framework fd39ca89-d7fd-4df8-ad50-dbb493d1cd7b-0000 (default)
> I0229 11:08:57.029039   675 master.cpp:5459] Sending 1 inverse offers to 
> framework fd39ca89-d7fd-4df8-ad50-dbb493d1cd7b-0000 (default)
> {code}
> The ideal expected workflow for this test is something like:
> - The framework receives offers from master.
> - The framework updates its maintenance schedule.
> - The current offer is rescinded.
> - A new offer is received from the master with unavailability set.
> - After the agent goes for maintenance, an inverse offer is sent.
> For some reason, in the logs we see that the master is sending 2 inverse 
> offers. The test seems to pass as we just check for the initial inverse offer 
> being present. This can also be reproduced by a modified version of the 
> original test.
> {code}
> // Test ensures that an offer will have an `unavailability` set if the
> // slave is scheduled to go down for maintenance.
> TEST_F(MasterMaintenanceTest, PendingUnavailabilityTest)
> {
>   Try<PID<Master>> master = StartMaster();
>   ASSERT_SOME(master);
>   MockExecutor exec(DEFAULT_EXECUTOR_ID);
>   Try<PID<Slave>> slave = StartSlave(&exec);
>   ASSERT_SOME(slave);
>   auto scheduler = std::make_shared<MockV1HTTPScheduler>();
>   EXPECT_CALL(*scheduler, heartbeat(_))
>     .WillRepeatedly(Return()); // Ignore heartbeats.
>   Future<Nothing> connected;
>   EXPECT_CALL(*scheduler, connected(_))
>     .WillOnce(FutureSatisfy(&connected))
>     .WillRepeatedly(Return()); // Ignore future invocations.
>   scheduler::TestV1Mesos mesos(master.get(), ContentType::PROTOBUF, 
> scheduler);
>   AWAIT_READY(connected);
>   Future<Event::Subscribed> subscribed;
>   EXPECT_CALL(*scheduler, subscribed(_, _))
>     .WillOnce(FutureArg<1>(&subscribed));
>   Future<Event::Offers> normalOffers;
>   Future<Event::Offers> unavailabilityOffers;
>   Future<Event::Offers> inverseOffers;
>   EXPECT_CALL(*scheduler, offers(_, _))
>     .WillOnce(FutureArg<1>(&normalOffers))
>     .WillOnce(FutureArg<1>(&unavailabilityOffers))
>     .WillOnce(FutureArg<1>(&inverseOffers));
>   // The original offers should be rescinded when the unavailability is 
> changed.
>   Future<Nothing> offerRescinded;
>   EXPECT_CALL(*scheduler, rescind(_, _))
>     .WillOnce(FutureSatisfy(&offerRescinded));
>   {
>     Call call;
>     call.set_type(Call::SUBSCRIBE);
>     Call::Subscribe* subscribe = call.mutable_subscribe();
>     subscribe->mutable_framework_info()->CopyFrom(DEFAULT_V1_FRAMEWORK_INFO);
>     mesos.send(call);
>   }
>   AWAIT_READY(subscribed);
>   v1::FrameworkID frameworkId(subscribed->framework_id());
>   AWAIT_READY(normalOffers);
>   EXPECT_NE(0, normalOffers->offers().size());
>   // Regular offers shouldn't have unavailability.
>   foreach (const v1::Offer& offer, normalOffers->offers()) {
>     EXPECT_FALSE(offer.has_unavailability());
>   }
>   // Schedule this slave for maintenance.
>   MachineID machine;
>   machine.set_hostname(maintenanceHostname);
>   machine.set_ip(stringify(slave.get().address.ip));
>   const Time start = Clock::now() + Seconds(60);
>   const Duration duration = Seconds(120);
>   const Unavailability unavailability = createUnavailability(start, duration);
>   // Post a valid schedule with one machine.
>   maintenance::Schedule schedule = createSchedule(
>       {createWindow({machine}, unavailability)});
>   // We have a few seconds between the first set of offers and the
>   // next allocation of offers. This should be enough time to perform
>   // a maintenance schedule update. This update will also trigger the
>   // rescinding of offers from the scheduled slave.
>   Future<Response> response = process::http::post(
>       master.get(),
>       "maintenance/schedule",
>       headers,
>       stringify(JSON::protobuf(schedule)));
>   AWAIT_EXPECT_RESPONSE_STATUS_EQ(OK().status, response);
>   // The original offers should be rescinded when the unavailability
>   // is changed.
>   AWAIT_READY(offerRescinded);
>   AWAIT_READY(unavailabilityOffers);
>   EXPECT_NE(0, unavailabilityOffers->offers().size());
>   // Make sure the new offers have the unavailability set.
>   foreach (const v1::Offer& offer, unavailabilityOffers->offers()) {
>     EXPECT_TRUE(offer.has_unavailability());
>     EXPECT_EQ(
>         unavailability.start().nanoseconds(),
>         offer.unavailability().start().nanoseconds());
>     EXPECT_EQ(
>         unavailability.duration().nanoseconds(),
>         offer.unavailability().duration().nanoseconds());
>   }
>   // We also expect an inverse offer for the slave to go under
>   // maintenance.
>   AWAIT_READY(inverseOffers);
>   EXPECT_NE(0, inverseOffers->inverse_offers().size());
>   EXPECT_CALL(exec, shutdown(_))
>     .Times(AtMost(1));
>   EXPECT_CALL(*scheduler, disconnected(_))
>     .Times(AtMost(1));
>   Shutdown(); // Must shutdown before 'containerizer' gets deallocated.
> }
> {code}
> Also, unrelated, we need to clean up this test to not expect multiple offers 
> i.e. remove {{numberOfOffers}} constant.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-4831) Master sometimes sends two inverse offers after the agent goes into maintenance.

Reply via email to