Benjamin Bannier created MESOS-9223:
---------------------------------------

             Summary: Storage local provider does not sufficiently handle 
container launch failures or errors
                 Key: MESOS-9223
                 URL: https://issues.apache.org/jira/browse/MESOS-9223
             Project: Mesos
          Issue Type: Improvement
          Components: agent, storage
            Reporter: Benjamin Bannier


The storage local resource provider as currently implemented does not handle 
launch failures or task errors of its standalone containers well enough, If 
e.g., a RP container fails to come up during node start a warning would be 
logged, but an operator still needs to detect degraded functionality, manually 
check the state of containers with {{GET_CONTAINERS}}, and decide whether the 
agent needs restarting; I suspect they do not have always have enough context 
for this decision. It would be better if the provider would either enforce a 
restart by failing over the whole agent, or by retrying the operation 
(optionally: up to some maximum amount of retries).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to