[ https://issues.apache.org/jira/browse/MESOS-9555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16766559#comment-16766559 ]
Benjamin Mahler commented on MESOS-9555: ---------------------------------------- [~dtw] thanks, that was enough information to debug: {noformat} I0212 20:19:36.469903 20854 hierarchical.cpp:2629] HierarchicalAllocatorProcess::trackReservations(daa9a6aa-ae6b-432f-a86e-a76b62460f27-S1, { marathon-special: ports(reservations: [(STATIC,marathon-special)]):[80-80, 5050-5050, 7777-7777, 8080-8080, 28370-28375] }) BEFORE: { marathon-special: {} } AFTER: { marathon-special: {} } -- I0212 20:21:30.977318 20859 hierarchical.cpp:2672] HierarchicalAllocatorProcess::untrackReservations(878dc5a9-d433-40ed-8f62-a24c19927edb-S19, { marathon-special: ports(reservations: [(STATIC,marathon-special)]):[80-80, 5050-5050, 7777-7777, 8080-8080, 28370-28375] }) BEFORE: { marathon-special: {} } AFTER: { } -- F0212 20:23:03.244000 20855 hierarchical.cpp:2644] Check failed: reservationScalarQuantities.contains(role) HierarchicalAllocatorProcess::untrackReservations(daa9a6aa-ae6b-432f-a86e-a76b62460f27-S1, { marathon-special: ports(reservations: [(STATIC,marathon-special)]):[80-80, 5050-5050, 7777-7777, 8080-8080, 28370-28375] }) BEFORE: { } CURRENT: { } {noformat} The tracking logic is incorrectly dealing with non-scalar resources. It's not supposed to insert empty entries when tracking, or expect empty entries when untracking an empty amount, but it does both. > Check failed: reservationScalarQuantities.contains(role) > -------------------------------------------------------- > > Key: MESOS-9555 > URL: https://issues.apache.org/jira/browse/MESOS-9555 > Project: Mesos > Issue Type: Bug > Components: allocation, master > Affects Versions: 1.5.0 > Environment: * Mesos 1.5 > * {{DISTRIB_ID=Ubuntu}} > * {{DISTRIB_RELEASE=16.04}} > * {{DISTRIB_CODENAME=xenial}} > * {{DISTRIB_DESCRIPTION="Ubuntu 16.04.5 LTS"}} > Reporter: Jeff Pollard > Priority: Critical > Attachments: > 0001-Added-additional-logging-to-1.5.2-to-investigate-MES.patch, mesos.log, > mesos_leader.log > > > We recently upgraded our Mesos cluster from version 1.3 to 1.5, and since > then have been getting periodic master crashes due to this error: > {code:java} > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: F0205 15:53:57.385118 8434 > hierarchical.cpp:2630] Check failed: > reservationScalarQuantities.contains(role){code} > Full stack trace is at the end of this issue description. When the master > fails, we automatically restart it and it rejoins the cluster just fine. I > did some initial searching and was unable to find any existing bug reports or > other people experiencing this issue. We run a cluster of 3 masters, and see > crashes on all 3 instances. > Right before the crash, we saw a {{Removed agent:...}} log line noting that > it was agent 9b912afa-1ced-49db-9c85-7bc5a22ef072-S6 that was removed. > {code:java} > 294929:Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: I0205 > 15:53:57.384759 8432 master.cpp:9893] Removed agent > 9b912afa-1ced-49db-9c85-7bc5a22ef072-S6 at slave(1)@10.0.18.78:5051 > (10.0.18.78): the agent unregistered{code} > I saved the full log from the master, so happy to provide more info from it, > or anything else about our current environment. > Full stack trace is below. > {code:java} > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e9170a7d > google::LogMessage::Fail() > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e9172830 > google::LogMessage::SendToLog() > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e9170663 > google::LogMessage::Flush() > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e9173259 > google::LogMessageFatal::~LogMessageFatal() > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e8443cbd > mesos::internal::master::allocator::internal::HierarchicalAllocatorProcess::untrackReservations() > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e8448fcd > mesos::internal::master::allocator::internal::HierarchicalAllocatorProcess::removeSlave() > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e90c4f11 > process::ProcessBase::consume() > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e90dea4a > process::ProcessManager::resume() > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e90e25d6 > _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e6700c80 (unknown) > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e5f136ba > start_thread > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e5c4941d > (unknown){code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)