[ https://issues.apache.org/jira/browse/MESOS-9555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16766593#comment-16766593 ]
Jeff Pollard commented on MESOS-9555: ------------------------------------- I think we should be okay to run with the current public release for a bit. We've been running this Mesos version for a little while, and the crashes are not affecting our production environment so far. Did you have an estimate on when the next public release with this fix will be out? > Check failed: reservationScalarQuantities.contains(role) > -------------------------------------------------------- > > Key: MESOS-9555 > URL: https://issues.apache.org/jira/browse/MESOS-9555 > Project: Mesos > Issue Type: Bug > Components: allocation, master > Affects Versions: 1.5.0, 1.5.1, 1.5.2, 1.6.0, 1.6.1, 1.6.2, 1.7.0, 1.7.1 > Environment: * Mesos 1.5 > * {{DISTRIB_ID=Ubuntu}} > * {{DISTRIB_RELEASE=16.04}} > * {{DISTRIB_CODENAME=xenial}} > * {{DISTRIB_DESCRIPTION="Ubuntu 16.04.5 LTS"}} > Reporter: Jeff Pollard > Assignee: Benjamin Mahler > Priority: Critical > Attachments: > 0001-Added-additional-logging-to-1.5.2-to-investigate-MES.patch, > 0001-Fixed-an-allocator-crash-during-reservation-tracking.patch, mesos.log, > mesos_leader.log > > > We recently upgraded our Mesos cluster from version 1.3 to 1.5, and since > then have been getting periodic master crashes due to this error: > {code:java} > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: F0205 15:53:57.385118 8434 > hierarchical.cpp:2630] Check failed: > reservationScalarQuantities.contains(role){code} > Full stack trace is at the end of this issue description. When the master > fails, we automatically restart it and it rejoins the cluster just fine. I > did some initial searching and was unable to find any existing bug reports or > other people experiencing this issue. We run a cluster of 3 masters, and see > crashes on all 3 instances. > Right before the crash, we saw a {{Removed agent:...}} log line noting that > it was agent 9b912afa-1ced-49db-9c85-7bc5a22ef072-S6 that was removed. > {code:java} > 294929:Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: I0205 > 15:53:57.384759 8432 master.cpp:9893] Removed agent > 9b912afa-1ced-49db-9c85-7bc5a22ef072-S6 at slave(1)@10.0.18.78:5051 > (10.0.18.78): the agent unregistered{code} > I saved the full log from the master, so happy to provide more info from it, > or anything else about our current environment. > Full stack trace is below. > {code:java} > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e9170a7d > google::LogMessage::Fail() > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e9172830 > google::LogMessage::SendToLog() > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e9170663 > google::LogMessage::Flush() > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e9173259 > google::LogMessageFatal::~LogMessageFatal() > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e8443cbd > mesos::internal::master::allocator::internal::HierarchicalAllocatorProcess::untrackReservations() > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e8448fcd > mesos::internal::master::allocator::internal::HierarchicalAllocatorProcess::removeSlave() > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e90c4f11 > process::ProcessBase::consume() > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e90dea4a > process::ProcessManager::resume() > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e90e25d6 > _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e6700c80 (unknown) > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e5f136ba > start_thread > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e5c4941d > (unknown){code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)