[
https://issues.apache.org/jira/browse/MESOS-5763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15359619#comment-15359619
]
Yan Xu commented on MESOS-5763:
-------------------------------
The issue appears to be with the containerizer never setting the container
state to FETCHING ever since it was first introduced in:
https://github.com/apache/mesos/commit/25489e53e9f308c5fca3d0293aeceb716b53149d
If the container state is never set to FETCHING, the fetcher is [not
killed|https://github.com/apache/mesos/blob/53de5578c6ffc418275ff801838befc7b3900504/src/slave/containerizer/mesos/containerizer.cpp#L1628]
as part of the container destroy.
As a result, according to the agent the executor hasn't been launched yet when
it times out so the agent can't call
[containerizer->wait(containerId)|https://github.com/apache/mesos/blob/53de5578c6ffc418275ff801838befc7b3900504/src/slave/slave.cpp#L4026]
in {{Slave::executorLaunched}} to wait for executor termination future. It's
blocked by the fetcher.
When I set the container state to FETCHING appropriately the issue is fixed.
Will submit a patch.
I observed this with 0.28 and suspect it's with every version since 0.22 but I
can't confirm. If there's are future backports this can be a candidate but I am
not sure this happens often enough for people to warrant a backport for this
issue alone.
Thoughts?
/cc [~jieyu] [~tnachen]
> Task stuck in fetching is not cleaned up after
> --executor_registration_timeout.
> -------------------------------------------------------------------------------
>
> Key: MESOS-5763
> URL: https://issues.apache.org/jira/browse/MESOS-5763
> Project: Mesos
> Issue Type: Bug
> Components: containerization
> Affects Versions: 0.28.0, 1.0.0, 0.29.0
> Reporter: Yan Xu
> Assignee: Yan Xu
>
> When the fetching process hangs forever due to reasons such as HDFS issues,
> Mesos containerizer would attempt to destroy the container and kill the
> executor after {{--executor_registration_timeout}}. However this reliably
> fails for us: the executor would be killed by the launcher destroy and the
> container would be destroyed but the agent would never find out that the
> executor is terminated thus leaving the task in the STAGING state forever.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)