[ 
https://issues.apache.org/jira/browse/MESOS-8716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16408669#comment-16408669
 ] 

James Peach commented on MESOS-8716:
------------------------------------

Here's a stack trace that is symptomatic of this problem:
{noformat}
2018-03-21T04:31:49.272492+00:00 mslave1218 kernel: [3969040.584460] Call Trace:
2018-03-21T04:31:49.272494+00:00 mslave1218 kernel: [3969040.587253]  
[<ffffffff81579159>] schedule+0x39/0x90
2018-03-21T04:31:49.283684+00:00 mslave1218 kernel: [3969040.592551]  
[<ffffffff810975ad>] __refrigerator+0x4d/0x140
2018-03-21T04:31:49.283689+00:00 mslave1218 kernel: [3969040.598458]  
[<ffffffff810570ed>] get_signal+0x36d/0x390
2018-03-21T04:31:49.294814+00:00 mslave1218 kernel: [3969040.604103]  
[<ffffffff81002c30>] do_signal+0x20/0x130
2018-03-21T04:31:49.294820+00:00 mslave1218 kernel: [3969040.609576]  
[<ffffffff8109743d>] ? freezing_slow_path+0x4d/0x80
2018-03-21T04:31:49.306702+00:00 mslave1218 kernel: [3969040.615939]  
[<ffffffff8104b739>] ? SyS_wait4+0xa9/0xf0
2018-03-21T04:31:49.306706+00:00 mslave1218 kernel: [3969040.621495]  
[<ffffffff81049b40>] ? is_current_pgrp_orphaned+0xe0/0xe0
2018-03-21T04:31:49.319554+00:00 mslave1218 kernel: [3969040.628358]  
[<ffffffff81002d98>] do_notify_resume+0x58/0x70
2018-03-21T04:31:49.319559+00:00 mslave1218 kernel: [3969040.634351]  
[<ffffffff8157c802>] int_signal+0x12/0x17
{noformat}

> Freezer controller is not returned to thaw if task termination fails
> --------------------------------------------------------------------
>
>                 Key: MESOS-8716
>                 URL: https://issues.apache.org/jira/browse/MESOS-8716
>             Project: Mesos
>          Issue Type: Bug
>          Components: agent, containerization
>    Affects Versions: 1.3.2
>            Reporter: Sargun Dhillon
>            Priority: Major
>
> This issue is related to https://issues.apache.org/jira/browse/MESOS-8004. A 
> container may fail to terminate for a variety of reasons. One common reason 
> in our system is when containers rely on external storage, they run fsync 
> before exiting (fsync on SIGTERM). This makes it so that the termination can 
> timeout. 
>  
> Even though Mesos has sent the requisite kill signals, the task will never 
> terminate because the cgroup stays frozen. 
>  
> The intended behaviour should be that on failure to terminate, if the pids 
> isolator is running, pids.max should be set to 0, to prevent further 
> processes from being created, the cgroup should be walked and sigkilled, and 
> then thawed. Once the processes finish thawing, the kill signal will be 
> delivered, and processed, resulting in the container finally finishing,



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to