[
https://issues.apache.org/jira/browse/MESOS-7966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16492785#comment-16492785
]
Benno Evers commented on MESOS-7966:
------------------------------------
I tried to reproduce it running a custom Mesos 1.2 (compiled from
de306b5786de3c221bae1457c6f2ccaeb38eef9f), modifying the provided call.py
script by changing the hostnames and moving the timestamp into the future and
then running it via
{noformat}
while :;
python call.py;
done
{noformat}
for a few minutes, but could not create a master crash.
Looking at the code, I don't see any obvious race. The
`Master::updateUnavailability()` handler in the master dispatches deletions for
all existing inverse offers to the allocator actor, removes the offers from its
own internal data structures, and afterwards dispatches a deletion for the
maintenance to the allocator actor.
The assertion triggers because the allocator gets a request to update an
inverse offer when the maintenance doesn't exist yet/anymore, but I havent
really found a code path that could lead to this.
If you could update your filtered log to include the log lines generated by the
following block in master.cpp, I think this would help to pin down the exact
sequence of deletions/additions that triggers the crash:
{noformat}
if (unavailability.isSome()) {
// TODO(jmlvanre): Add stream operator for unavailability.
LOG(INFO) << "Updating unavailability of agent " << *slave
<< ", starting at "
<< Nanoseconds(unavailability.get().start().nanoseconds());
} else {
LOG(INFO) << "Removing unavailability of agent " << *slave;
}
{noformat}
> check for maintenance on agent causes fatal error
> -------------------------------------------------
>
> Key: MESOS-7966
> URL: https://issues.apache.org/jira/browse/MESOS-7966
> Project: Mesos
> Issue Type: Bug
> Components: master
> Affects Versions: 1.1.0
> Reporter: Rob Johnson
> Assignee: Joseph Wu
> Priority: Critical
> Labels: mesosphere, reliability
>
> We interact with the maintenance API frequently to orchestrate gracefully
> draining agents of tasks without impacting service availability.
> Occasionally we seem to trigger a fatal error in Mesos when interacting with
> the api. This happens relatively frequently, and impacts us when downstream
> frameworks (marathon) react badly to leader elections.
> Here is the log line that we see when the master dies:
> {code}
> F0911 12:18:49.543401 123748 hierarchical.cpp:872] Check failed:
> slaves[slaveId].maintenance.isSome()
> {code}
> It's quite possibly we're using the maintenance API in the wrong way. We're
> happy to provide any other logs you need - please let me know what would be
> useful for debugging.
> Thanks.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)