Benjamin Bannier created MESOS-9542:
---------------------------------------

             Summary: Hierarchical allocator check failure when an operation on 
a shutdown framework finishes
                 Key: MESOS-9542
                 URL: https://issues.apache.org/jira/browse/MESOS-9542
             Project: Mesos
          Issue Type: Bug
          Components: master
    Affects Versions: 1.7.1, 1.7.0, 1.6.1, 1.6.0, 1.5.2, 1.5.1, 1.5.0, 1.8.0
            Reporter: Benjamin Bannier


When a non-speculated operation like e.g., {{CREATE_DISK}} becomes terminal 
after the originating framework was torn down, we run into an assertion failure 
in the allocator.
{noformat}
I0129 11:55:35.764394 57857 master.cpp:11373] Updating the state of operation 
'operation' (uuid: 10a782bd-9e60-42da-90d6-c00997a25645) for framework 
a4d0499b-c0d3-4abf-8458-73e595d061ce-0000 (latest state: OPERATION_PENDING, 
status update state: OPERATION_FINISHED)
F0129 11:55:35.764744 57925 hierarchical.cpp:834] Check failed: 
frameworks.contains(frameworkId){noformat}
With non-speculated operations like e.g., {{CREATE_DISK}} it became possible 
that operations outlive their originating framework. This was not possible with 
speculated operations like {{RESERVE}} which were always applied immediately by 
the master.

The master does not take this into account, but instead unconditionally calls 
{{Allocator::updateAllocation}} which asserts that the framework is still known 
to the allocator.

Reproducer:
 * register a framework with the master.
 * add a master with a resource provider.
 * let the framework trigger a non-speculated operation like {{CREATE_DISK.}}
 * tear down the framework before a terminal operation status update reaches 
the master; this causes the master to e.g., remove the framework from the 
allocator.
 * let a terminal, successful operation status update reach the master
 * 💥 

To solve this we should cleanup the lifetimes of operations. Since operations 
can outlive their framework (unlike e.g., tasks), we probably need a different 
approach here.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to