[ 
https://issues.apache.org/jira/browse/MESOS-5763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15359619#comment-15359619
 ] 

Yan Xu commented on MESOS-5763:
-------------------------------

The issue appears to be with the containerizer never setting the container 
state to FETCHING ever since it was first introduced in:
https://github.com/apache/mesos/commit/25489e53e9f308c5fca3d0293aeceb716b53149d

If the container state is never set to FETCHING, the fetcher is [not 
killed|https://github.com/apache/mesos/blob/53de5578c6ffc418275ff801838befc7b3900504/src/slave/containerizer/mesos/containerizer.cpp#L1628]
 as part of the container destroy.

As a result, according to the agent the executor hasn't been launched yet when 
it times out so the agent can't call 
[containerizer->wait(containerId)|https://github.com/apache/mesos/blob/53de5578c6ffc418275ff801838befc7b3900504/src/slave/slave.cpp#L4026]
 in {{Slave::executorLaunched}} to wait for executor termination future. It's 
blocked by the fetcher.

When I set the container state to FETCHING appropriately the issue is fixed. 
Will submit a patch.

I observed this with 0.28 and suspect it's with every version since 0.22 but I 
can't confirm. If there's are future backports this can be a candidate but I am 
not sure this happens often enough for people to warrant a backport for this 
issue alone.

Thoughts?

/cc [~jieyu] [~tnachen]


> Task stuck in fetching is not cleaned up after 
> --executor_registration_timeout.
> -------------------------------------------------------------------------------
>
>                 Key: MESOS-5763
>                 URL: https://issues.apache.org/jira/browse/MESOS-5763
>             Project: Mesos
>          Issue Type: Bug
>          Components: containerization
>    Affects Versions: 0.28.0, 1.0.0, 0.29.0
>            Reporter: Yan Xu
>            Assignee: Yan Xu
>
> When the fetching process hangs forever due to reasons such as HDFS issues, 
> Mesos containerizer would attempt to destroy the container and kill the 
> executor after {{--executor_registration_timeout}}. However this reliably 
> fails for us: the executor would be killed by the launcher destroy and the 
> container would be destroyed but the agent would never find out that the 
> executor is terminated thus leaving the task in the STAGING state forever.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to