[ 
https://issues.apache.org/jira/browse/MESOS-7966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16492785#comment-16492785
 ] 

Benno Evers commented on MESOS-7966:
------------------------------------

I tried to reproduce it running a custom Mesos 1.2 (compiled from 
de306b5786de3c221bae1457c6f2ccaeb38eef9f), modifying the provided call.py 
script by changing the hostnames and moving the timestamp into the future and 
then running it via
{noformat}
while :;
  python call.py;
done
{noformat}
for a few minutes, but could not create a master crash.

Looking at the code, I don't see any obvious race. The 
`Master::updateUnavailability()` handler in the master dispatches deletions for 
all existing inverse offers to the allocator actor, removes the offers from its 
own internal data structures, and afterwards dispatches a deletion for the 
maintenance to the allocator actor.

The assertion triggers because the allocator gets a request to update an 
inverse offer when the maintenance doesn't exist yet/anymore, but I havent 
really found a code path that could lead to this.

If you could update your filtered log to include the log lines generated by the 
following block in master.cpp, I think this would help to pin down the exact 
sequence of deletions/additions that triggers the crash:

{noformat}
      if (unavailability.isSome()) {
        // TODO(jmlvanre): Add stream operator for unavailability.
        LOG(INFO) << "Updating unavailability of agent " << *slave
                  << ", starting at "
                  << Nanoseconds(unavailability.get().start().nanoseconds());
      } else {
        LOG(INFO) << "Removing unavailability of agent " << *slave;
      }
{noformat}

> check for maintenance on agent causes fatal error
> -------------------------------------------------
>
>                 Key: MESOS-7966
>                 URL: https://issues.apache.org/jira/browse/MESOS-7966
>             Project: Mesos
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 1.1.0
>            Reporter: Rob Johnson
>            Assignee: Joseph Wu
>            Priority: Critical
>              Labels: mesosphere, reliability
>
> We interact with the maintenance API frequently to orchestrate gracefully 
> draining agents of tasks without impacting service availability.
> Occasionally we seem to trigger a fatal error in Mesos when interacting with 
> the api. This happens relatively frequently, and impacts us when downstream 
> frameworks (marathon) react badly to leader elections.
> Here is the log line that we see when the master dies:
> {code}
> F0911 12:18:49.543401 123748 hierarchical.cpp:872] Check failed: 
> slaves[slaveId].maintenance.isSome()
> {code}
> It's quite possibly we're using the maintenance API in the wrong way. We're 
> happy to provide any other logs you need - please let me know what would be 
> useful for debugging.
> Thanks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to