Jeff Pollard created MESOS-9555:
-----------------------------------

             Summary: Check failed: reservationScalarQuantities.contains(role)
                 Key: MESOS-9555
                 URL: https://issues.apache.org/jira/browse/MESOS-9555
             Project: Mesos
          Issue Type: Bug
          Components: master
    Affects Versions: 1.5.0
         Environment: * Mesos 1.5
 * {{DISTRIB_ID=Ubuntu}}
 * {{DISTRIB_RELEASE=16.04}}
 * {{DISTRIB_CODENAME=xenial}}
 * {{DISTRIB_DESCRIPTION="Ubuntu 16.04.5 LTS"}}
            Reporter: Jeff Pollard


We recently upgraded our Mesos cluster from version 1.3 to 1.5, and since then 
have been getting periodic master crashes due to this error:
{code:java}
Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: F0205 15:53:57.385118 8434 
hierarchical.cpp:2630] Check failed: 
reservationScalarQuantities.contains(role){code}
Full stack trace is at the end of this issue description. When the master 
fails, we automatically restart it and it rejoins the cluster just fine. I did 
some initial searching and was unable to find any existing bug reports or other 
people experiencing this issue. We run a cluster of 3 masters, and see crashes 
on all 3 instances.

Right before the crash, we saw a {{Removed agent:...}} log line noting that it 
was agent xxx that was removed. However, interestingly, the same agent ID was 
"removed" 2x before that, however with different IP addresses.
{code:java}
241742:Feb 5 14:31:07 ip-10-0-16-140 mesos-master[8414]: I0205 14:31:07.178582 
8432 master.cpp:9893] Removed agent 9b912afa-1ced-49db-9c85-7bc5a22ef072-S7 at 
slave(1)@10.0.25.174:5051 (10.0.25.174): the agent unregistered
283024:Feb 5 15:32:56 ip-10-0-16-140 mesos-master[8414]: I0205 15:32:56.047614 
8432 master.cpp:9893] Removed agent 9b912afa-1ced-49db-9c85-7bc5a22ef072-S2 at 
slave(1)@10.0.26.65:5051 (10.0.26.65): the agent unregistered
294929:Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: I0205 15:53:57.384759 
8432 master.cpp:9893] Removed agent 9b912afa-1ced-49db-9c85-7bc5a22ef072-S6 at 
slave(1)@10.0.18.78:5051 (10.0.18.78): the agent unregistered{code}
I'm not too familiar with agent ID -> IP mapping, but I presume it would be 
1:1, so that is unexpected.

I saved the full log from the master, so happy to provide more info from it, or 
anything else about our current environment.

Full stack trace is below.
{code:java}
Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e9170a7d 
google::LogMessage::Fail()
Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e9172830 
google::LogMessage::SendToLog()
Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e9170663 
google::LogMessage::Flush()
Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e9173259 
google::LogMessageFatal::~LogMessageFatal()
Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e8443cbd 
mesos::internal::master::allocator::internal::HierarchicalAllocatorProcess::untrackReservations()
Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e8448fcd 
mesos::internal::master::allocator::internal::HierarchicalAllocatorProcess::removeSlave()
Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e90c4f11 
process::ProcessBase::consume()
Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e90dea4a 
process::ProcessManager::resume()
Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e90e25d6 
_ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e6700c80 (unknown)
Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e5f136ba start_thread
Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e5c4941d 
(unknown){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to