[
https://issues.apache.org/jira/browse/MESOS-8716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16408669#comment-16408669
]
James Peach commented on MESOS-8716:
------------------------------------
Here's a stack trace that is symptomatic of this problem:
{noformat}
2018-03-21T04:31:49.272492+00:00 mslave1218 kernel: [3969040.584460] Call Trace:
2018-03-21T04:31:49.272494+00:00 mslave1218 kernel: [3969040.587253]
[<ffffffff81579159>] schedule+0x39/0x90
2018-03-21T04:31:49.283684+00:00 mslave1218 kernel: [3969040.592551]
[<ffffffff810975ad>] __refrigerator+0x4d/0x140
2018-03-21T04:31:49.283689+00:00 mslave1218 kernel: [3969040.598458]
[<ffffffff810570ed>] get_signal+0x36d/0x390
2018-03-21T04:31:49.294814+00:00 mslave1218 kernel: [3969040.604103]
[<ffffffff81002c30>] do_signal+0x20/0x130
2018-03-21T04:31:49.294820+00:00 mslave1218 kernel: [3969040.609576]
[<ffffffff8109743d>] ? freezing_slow_path+0x4d/0x80
2018-03-21T04:31:49.306702+00:00 mslave1218 kernel: [3969040.615939]
[<ffffffff8104b739>] ? SyS_wait4+0xa9/0xf0
2018-03-21T04:31:49.306706+00:00 mslave1218 kernel: [3969040.621495]
[<ffffffff81049b40>] ? is_current_pgrp_orphaned+0xe0/0xe0
2018-03-21T04:31:49.319554+00:00 mslave1218 kernel: [3969040.628358]
[<ffffffff81002d98>] do_notify_resume+0x58/0x70
2018-03-21T04:31:49.319559+00:00 mslave1218 kernel: [3969040.634351]
[<ffffffff8157c802>] int_signal+0x12/0x17
{noformat}
> Freezer controller is not returned to thaw if task termination fails
> --------------------------------------------------------------------
>
> Key: MESOS-8716
> URL: https://issues.apache.org/jira/browse/MESOS-8716
> Project: Mesos
> Issue Type: Bug
> Components: agent, containerization
> Affects Versions: 1.3.2
> Reporter: Sargun Dhillon
> Priority: Major
>
> This issue is related to https://issues.apache.org/jira/browse/MESOS-8004. A
> container may fail to terminate for a variety of reasons. One common reason
> in our system is when containers rely on external storage, they run fsync
> before exiting (fsync on SIGTERM). This makes it so that the termination can
> timeout.
>
> Even though Mesos has sent the requisite kill signals, the task will never
> terminate because the cgroup stays frozen.
>
> The intended behaviour should be that on failure to terminate, if the pids
> isolator is running, pids.max should be set to 0, to prevent further
> processes from being created, the cgroup should be walked and sigkilled, and
> then thawed. Once the processes finish thawing, the kill signal will be
> delivered, and processed, resulting in the container finally finishing,
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)