[
https://issues.apache.org/jira/browse/MESOS-5763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15365181#comment-15365181
]
Yan Xu commented on MESOS-5763:
-------------------------------
Committed everything other than https://reviews.apache.org/r/49726/
Let's add a integration test as a separate JIRA.
{noformat:title=}
commit 13c2020d429d6d000bb37649f2a1be47de5b8f8c
Author: Jiang Yan Xu <[email protected]>
Date: Fri Jul 1 18:12:01 2016 -0700
Made Mesos containerizer error messages more consistent.
We've been using slightly different wordings of the same condition in
multiple places in Mesos containerizer but they don't provide
additional information about where this failure is thrown in a long
continuation chain. Since failures don't capture the location in the
code we'd better distinguish them in a more meaningful way to assist
debugging.
Review: https://reviews.apache.org/r/49653
commit 8907b5d5e1f007c4592a0417a4d9f20d7e1f8efd
Author: Jiang Yan Xu <[email protected]>
Date: Fri Jul 1 18:11:29 2016 -0700
Improved Mesos containerizer invariant checking.
One of the reasons for MESOS-5763 is due to the lack invariant
checking. Mesos containerizer transitions the container state in
particular ways so when continuation chains could potentially be
interleaved with other actions we should verify the state transitions.
Review: https://reviews.apache.org/r/49652
commit 48b1bfa6ec5f88ceea327ae3c5345fd4d11442c7
Author: Jiang Yan Xu <[email protected]>
Date: Fri Jul 1 15:25:54 2016 -0700
Improved Mesos containerizer logging and documentation.
Review: https://reviews.apache.org/r/49651
commit dc18dd7a5ec48a184aeb1c5a7c475ecf7691734b
Author: Jiang Yan Xu <[email protected]>
Date: Wed Jul 6 13:48:34 2016 -0700
Fail container launch if it's destroyed during logger->prepare().
Review: https://reviews.apache.org/r/49725
commit 114474c443678997da8f931a41703f1095206421
Author: Jiang Yan Xu <[email protected]>
Date: Fri Jul 1 15:27:37 2016 -0700
Fixed Mesos containerizer to set container FETCHING state.
If the container state is not properly set to FETCHING, Mesos agent
cannot detect the terminated executor when the fetcher times out.
Review: https://reviews.apache.org/r/49650
{noformat}
> Task stuck in fetching is not cleaned up after
> --executor_registration_timeout.
> -------------------------------------------------------------------------------
>
> Key: MESOS-5763
> URL: https://issues.apache.org/jira/browse/MESOS-5763
> Project: Mesos
> Issue Type: Bug
> Components: containerization
> Affects Versions: 0.28.0, 1.0.0, 0.29.0
> Reporter: Yan Xu
> Assignee: Yan Xu
> Priority: Blocker
> Fix For: 0.28.3, 1.0.0, 0.27.4
>
>
> When the fetching process hangs forever due to reasons such as HDFS issues,
> Mesos containerizer would attempt to destroy the container and kill the
> executor after {{--executor_registration_timeout}}. However this reliably
> fails for us: the executor would be killed by the launcher destroy and the
> container would be destroyed but the agent would never find out that the
> executor is terminated thus leaving the task in the STAGING state forever.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)