[ https://issues.apache.org/jira/browse/MESOS-10014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16957497#comment-16957497 ]
Meng Zhu commented on MESOS-10014: ---------------------------------- Hmm, the following log message looks problematic: {noformat} I1018 09:05:14.228754 21394 hierarchical.cpp:955] Added agent e6284079-cb6a-4a47-8f9a-ea9b84ff622a-S0 (ip-172-16-10-17.ec2.internal) with cpus:2; mem:1024; disk:1024; ports:[31000-32000] (offered or allocated: {}) I1018 09:05:14.229159 21394 hierarchical.cpp:1100] Grew agent e6284079-cb6a-4a47-8f9a-ea9b84ff622a-S0 by disk[RAW(,,profile)]:200 (total), { } (used) I1018 09:05:14.229632 21394 hierarchical.cpp:1057] Agent e6284079-cb6a-4a47-8f9a-ea9b84ff622a-S0 (ip-172-16-10-17.ec2.internal) updated with total resources cpus:2; mem:1024; disk:1024; ports:[31000-32000] I1018 09:05:14.230063 21394 hierarchical.cpp:1843] Performed allocation for 1 agents in 128843ns I1018 09:05:14.230569 21391 master.cpp:10926] Recovered orphan operation 71647a26-b5fe-4b97-9162-0abb2785b909 (ID: operation) on agent e6284079-cb6a-4a47-8f9a-ea9b84ff622a-S0 belonging to framework e6284079-cb6a-4a47-8f9a-ea9b84ff622a-0000 in state OPERATION_PENDING I1018 09:05:14.230813 21391 master.cpp:10824] Adding framework e6284079-cb6a-4a47-8f9a-ea9b84ff622a-0000 (default) with roles { } suppressed I1018 09:05:14.230991 21391 master.cpp:8295] Updating framework e6284079-cb6a-4a47-8f9a-ea9b84ff622a-0000 (default) with roles { } suppressed I1018 09:05:14.231298 21390 hierarchical.cpp:1100] Grew agent e6284079-cb6a-4a47-8f9a-ea9b84ff622a-S0 by disk[RAW(,,profile)]:200 (total), { e6284079-cb6a-4a47-8f9a-ea9b84ff622a-0000: disk(allocated: default-role)[RAW(,,profile)]:200 } (used) {noformat} This happens after the master failover. In particular, there are two `Grew agent ...` indicating two resource providers (each with 200 disk) are added. And the latter one contains *used* 200 disk. This is probably the same 200 disk resource printed out above by [~bmahler] I suspect this relates to orphan operations cc/[~greggomann] > `tryUntrackFrameworkUnderRole` check failed in > `HierarchicalAllocatorProcess::removeFramework`. > ----------------------------------------------------------------------------------------------- > > Key: MESOS-10014 > URL: https://issues.apache.org/jira/browse/MESOS-10014 > Project: Mesos > Issue Type: Bug > Components: master, test > Affects Versions: 1.10 > Reporter: Andrei Budnik > Priority: Major > Labels: flaky-test, resource-management > Attachments: AgentPendingOperationAfterMasterFailover-badrun.txt > > > `ContentType/OperationReconciliationTest.AgentPendingOperationAfterMasterFailover/0` > test failed: > {code:java} > F1018 09:05:14.310616 21391 hierarchical.cpp:745] Check failed: > tryUntrackFrameworkUnderRole(framework, role) Framework: > e6284079-cb6a-4a47-8f9a-ea9b84ff622a-0000 role: default-role > *** Check failure stack trace: *** > @ 0x7f40fff0a1f6 google::LogMessage::Fail() > @ 0x7f40fff0a14f google::LogMessage::SendToLog() > @ 0x7f40fff09a91 google::LogMessage::Flush() > @ 0x7f40fff0d12f google::LogMessageFatal::~LogMessageFatal() > @ 0x7f410fd828ac > mesos::internal::master::allocator::internal::HierarchicalAllocatorProcess::removeFramework() > @ 0x186b29f > _ZZN7process8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS1_11FrameworkIDES8_EEvRKNS_3PIDIT_EEMSA_FvT0_EOT1_ENKUlOS6_PNS_11ProcessBaseEE_clESJ_SL_ > @ 0x189c273 > _ZN5cpp176invokeIZN7process8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS3_11FrameworkIDESA_EEvRKNS1_3PIDIT_EEMSC_FvT0_EOT1_EUlOS8_PNS1_11ProcessBaseEE_JS8_SN_EEEDTclcl7forwardISC_Efp_Espcl7forwardIT0_Efp0_EEEOSC_DpOSP_ > @ 0x18990b7 > _ZN6lambda8internal7PartialIZN7process8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS4_11FrameworkIDESB_EEvRKNS2_3PIDIT_EEMSD_FvT0_EOT1_EUlOS9_PNS2_11ProcessBaseEE_JS9_St12_PlaceholderILi1EEEE13invoke_expandISP_St5tupleIJS9_SR_EESU_IJOSO_EEJLm0ELm1EEEEDTcl6invokecl7forwardISD_Efp_Espcl6expandcl3getIXT2_EEcl7forwardISH_Efp0_EEcl7forwardISK_Efp2_EEEEOSD_OSH_N5cpp1416integer_sequenceImJXspT2_EEEESL_ > @ 0x1896100 > _ZNO6lambda8internal7PartialIZN7process8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS4_11FrameworkIDESB_EEvRKNS2_3PIDIT_EEMSD_FvT0_EOT1_EUlOS9_PNS2_11ProcessBaseEE_IS9_St12_PlaceholderILi1EEEEclIISO_EEEDTcl13invoke_expandcl4movedtdefpT1fEcl4movedtdefpT10bound_argsEcvN5cpp1416integer_sequenceImILm0ELm1EEEE_Ecl16forward_as_tuplespcl7forwardIT_Efp_EEEEDpOSX_ > @ 0x1895174 > _ZN5cpp176invokeIN6lambda8internal7PartialIZN7process8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS6_11FrameworkIDESD_EEvRKNS4_3PIDIT_EEMSF_FvT0_EOT1_EUlOSB_PNS4_11ProcessBaseEE_ISB_St12_PlaceholderILi1EEEEEISQ_EEEDTclcl7forwardISF_Efp_Espcl7forwardIT0_Efp0_EEEOSF_DpOSV_ > @ 0x1894b2b > _ZN6lambda8internal6InvokeIvEclINS0_7PartialIZN7process8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS7_11FrameworkIDESE_EEvRKNS5_3PIDIT_EEMSG_FvT0_EOT1_EUlOSC_PNS5_11ProcessBaseEE_JSC_St12_PlaceholderILi1EEEEEJSR_EEEvOSG_DpOT0_ > @ 0x18943bc > _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNSA_11FrameworkIDESH_EEvRKNS1_3PIDIT_EEMSJ_FvT0_EOT1_EUlOSF_S3_E_ISF_St12_PlaceholderILi1EEEEEEclEOS3_ > @ 0x7f41016deb22 > _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEEclES3_ > @ 0x7f410169620c process::ProcessBase::consume() > @ 0x7f41016c0696 > _ZNO7process13DispatchEvent7consumeEPNS_13EventConsumerE > @ 0x1822baa process::ProcessBase::serve() > @ 0x7f4101692af1 process::ProcessManager::resume() > @ 0x7f410168ed68 > _ZZN7process14ProcessManager12init_threadsEvENKUlvE_clEv > @ 0x7f41016b81e2 > _ZNSt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEE9_M_invokeIJEEEvSt12_Index_tupleIJXspT_EEE > @ 0x7f41016b7244 > _ZNSt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEclEv > @ 0x7f41016b6088 > _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv > @ 0x7f40fca44590 execute_native_thread_routine > @ 0x7f40ffa77e25 start_thread > @ 0x7f40fa396bad __clone > @ (nil) (unknown) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)