[ 
https://issues.apache.org/jira/browse/MESOS-8839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16757399#comment-16757399
 ] 

Benjamin Bannier commented on MESOS-8839:
-----------------------------------------

Reopening as we saw this again in our internal CI with something close to 
today's {{master}} {{HEAD}}.

> Resource provider manager registrar recovery can race with agent on agent 
> state leading to hard failures
> --------------------------------------------------------------------------------------------------------
>
>                 Key: MESOS-8839
>                 URL: https://issues.apache.org/jira/browse/MESOS-8839
>             Project: Mesos
>          Issue Type: Bug
>          Components: agent, storage
>    Affects Versions: 1.6.0, 1.8.0
>            Reporter: Benjamin Bannier
>            Assignee: Benjamin Bannier
>            Priority: Blocker
>         Attachments: log
>
>
> When running in the agent the resource provider manager persists its state 
> into the agent's state. The agent uses a LevelDB state which protects against 
> concurrent access. The way we modelled LevelDB an {{fetch}} when a lock is 
> present leads to a failed {{Future}} result. When the resource provider 
> manager encounters a failed recovery it emits a fatal error, e.g.,
> {noformat}
> 11:48:26 F0425 11:48:26.650568 26819 manager.cpp:254] Failed to recover 
> resource provider manager registry: Failed: IO error: lock 
> /tmp/ParentChildContainerTypeAndContentType_AgentContainerAPITest_RecoverNestedContainer_10_HXbQCK/meta/slaves/6645885c-050a-4518-b896-a20b3e72a070-S0/resource_provider_registry/LOCK:
>  already held by process
> 11:48:26 *** Check failure stack trace: ***{noformat}
> We should not fail hard for such recoverable failure scenarios.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to