[ 
https://issues.apache.org/jira/browse/MESOS-9131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16619415#comment-16619415
 ] 

Alexander Rukletsov edited comment on MESOS-9131 at 9/18/18 5:57 PM:
---------------------------------------------------------------------

*{{master}} aka {{1.8-dev}}*:
{noformat}
commit 2fdc8f3cffc5eac91e5f2b0c6aef2254acfc2bd0
Author:     Andrei Budnik <abud...@mesosphere.com>
AuthorDate: Tue Sep 18 19:09:31 2018 +0200
Commit:     Alexander Rukletsov <al...@apache.org>
CommitDate: Tue Sep 18 19:09:31 2018 +0200

    Fixed IOSwitchboard waiting EOF from attach container input request.
    
    Previously, when a corresponding nested container terminated, while the
    user was attached to the container's stdin via `ATTACH_CONTAINER_INPUT`
    IOSwitchboard didn't terminate immediately. IOSwitchboard was waiting
    for EOF message from the input HTTP connection. Since the IOSwitchboard
    was stuck, the corresponding nested container was also stuck in
    `DESTROYING` state.
    
    This patch fixes the aforementioned issue by sending 200 `OK` response
    for `ATTACH_CONTAINER_INPUT` call in the case when io redirect is
    finished while reading from the HTTP input connection is not.
    
    Review: https://reviews.apache.org/r/68232/
{noformat}
{noformat}
commit e941d206f651bde861675a6517a89e44d1f61a34
Author:     Andrei Budnik <abud...@mesosphere.com>
AuthorDate: Tue Sep 18 19:10:01 2018 +0200
Commit:     Alexander Rukletsov <al...@apache.org>
CommitDate: Tue Sep 18 19:10:01 2018 +0200

    Added `AgentAPITest.LaunchNestedContainerSessionKillTask` test.
    
    This test verifies that IOSwitchboard, which holds an open HTTP input
    connection, terminates once IO redirects finish for the corresponding
    nested container.
    
    Review: https://reviews.apache.org/r/68230/
{noformat}
{noformat}
commit 7ad390b3aa261f4a39ff7f2c0842f2aae39005f4
Author:     Andrei Budnik <abud...@mesosphere.com>
AuthorDate: Tue Sep 18 19:10:07 2018 +0200
Commit:     Alexander Rukletsov <al...@apache.org>
CommitDate: Tue Sep 18 19:10:07 2018 +0200

    Added `AgentAPITest.AttachContainerInputRepeat` test.
    
    This test verifies that we can call `ATTACH_CONTAINER_INPUT` more
    than once. We send a short message first then we send a long message
    in chunks.
    
    Review: https://reviews.apache.org/r/68231/
{noformat}
*{{1.7.1}}*:
{noformat}
commit e9605a6243db41c1bbc85ec9ade112f2ef806c15
commit f672afef601c71d69a9eb4db3c191bacfe167d3e
commit 4a1b3186a2fa64bf7d94787f3546dd584e2f1186
{noformat}
*{{1.6.2}}*:
{noformat}
commit e3a9eb3b473a10f210913d568c1d9923ed05d933
commit a1798ae1fb2249280f4a4e9fec69eb9e37b95452
commit d82177d00a4a25d70aab172a91c855ad6b07f768
{noformat}


was (Author: alexr):
*{{master}} aka {{1.8-dev}}*:
{noformat}
commit 2fdc8f3cffc5eac91e5f2b0c6aef2254acfc2bd0
Author:     Andrei Budnik <abud...@mesosphere.com>
AuthorDate: Tue Sep 18 19:09:31 2018 +0200
Commit:     Alexander Rukletsov <al...@apache.org>
CommitDate: Tue Sep 18 19:09:31 2018 +0200

    Fixed IOSwitchboard waiting EOF from attach container input request.
    
    Previously, when a corresponding nested container terminated, while the
    user was attached to the container's stdin via `ATTACH_CONTAINER_INPUT`
    IOSwitchboard didn't terminate immediately. IOSwitchboard was waiting
    for EOF message from the input HTTP connection. Since the IOSwitchboard
    was stuck, the corresponding nested container was also stuck in
    `DESTROYING` state.
    
    This patch fixes the aforementioned issue by sending 200 `OK` response
    for `ATTACH_CONTAINER_INPUT` call in the case when io redirect is
    finished while reading from the HTTP input connection is not.
    
    Review: https://reviews.apache.org/r/68232/
{noformat}
{noformat}
commit e941d206f651bde861675a6517a89e44d1f61a34
Author:     Andrei Budnik <abud...@mesosphere.com>
AuthorDate: Tue Sep 18 19:10:01 2018 +0200
Commit:     Alexander Rukletsov <al...@apache.org>
CommitDate: Tue Sep 18 19:10:01 2018 +0200

    Added `AgentAPITest.LaunchNestedContainerSessionKillTask` test.
    
    This test verifies that IOSwitchboard, which holds an open HTTP input
    connection, terminates once IO redirects finish for the corresponding
    nested container.
    
    Review: https://reviews.apache.org/r/68230/
{noformat}
{noformat}
commit 7ad390b3aa261f4a39ff7f2c0842f2aae39005f4
Author:     Andrei Budnik <abud...@mesosphere.com>
AuthorDate: Tue Sep 18 19:10:07 2018 +0200
Commit:     Alexander Rukletsov <al...@apache.org>
CommitDate: Tue Sep 18 19:10:07 2018 +0200

    Added `AgentAPITest.AttachContainerInputRepeat` test.
    
    This test verifies that we can call `ATTACH_CONTAINER_INPUT` more
    than once. We send a short message first then we send a long message
    in chunks.
    
    Review: https://reviews.apache.org/r/68231/
{noformat}
*{{1.7.1}}*:
{noformat}
commit e9605a6243db41c1bbc85ec9ade112f2ef806c15
commit f672afef601c71d69a9eb4db3c191bacfe167d3e
commit 4a1b3186a2fa64bf7d94787f3546dd584e2f1186
{noformat}

> Health checks launching nested containers while a container is being 
> destroyed lead to unkillable tasks.
> --------------------------------------------------------------------------------------------------------
>
>                 Key: MESOS-9131
>                 URL: https://issues.apache.org/jira/browse/MESOS-9131
>             Project: Mesos
>          Issue Type: Bug
>          Components: agent, containerization
>    Affects Versions: 1.5.1
>            Reporter: Jan Schlicht
>            Assignee: Andrei Budnik
>            Priority: Blocker
>              Labels: container-stuck
>             Fix For: 1.5.2, 1.6.2, 1.7.1, 1.8.0
>
>
> A container might get stuck in {{DESTROYING}} state if there's a command 
> health check that starts new nested containers while its parent container is 
> getting destroyed.
> Here are some logs which unrelated lines removed. The 
> `REMOVE_NESTED_CONTAINER`/`LAUNCH_NESTED_CONTAINER_SESSION` keeps looping 
> afterwards.
> {noformat}
> 2018-04-16 12:37:54: I0416 12:37:54.235877  3863 containerizer.cpp:2807] 
> Container 
> db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133 has 
> exited
> 2018-04-16 12:37:54: I0416 12:37:54.235914  3863 containerizer.cpp:2354] 
> Destroying container 
> db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133 in 
> RUNNING state
> 2018-04-16 12:37:54: I0416 12:37:54.235932  3863 containerizer.cpp:2968] 
> Transitioning the state of container 
> db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133 
> from RUNNING to DESTROYING
> 2018-04-16 12:37:54: I0416 12:37:54.236100  3852 linux_launcher.cpp:514] 
> Asked to destroy container 
> db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133.e6e01854-40a0-4da3-b458-2b4cf52bbc11
> 2018-04-16 12:37:54: I0416 12:37:54.237671  3852 linux_launcher.cpp:560] 
> Using freezer to destroy cgroup 
> mesos/db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3/mesos/0e44d4d7-629f-41f1-80df-4aae9583d133/mesos/e6e01854-40a0-4da3-b458-2b4cf52bbc11
> 2018-04-16 12:37:54: I0416 12:37:54.240327  3852 cgroups.cpp:3060] Freezing 
> cgroup 
> /sys/fs/cgroup/freezer/mesos/db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3/mesos/0e44d4d7-629f-41f1-80df-4aae9583d133/mesos/e6e01854-40a0-4da3-b458-2b4cf52bbc11
> 2018-04-16 12:37:54: I0416 12:37:54.244179  3852 cgroups.cpp:1415] 
> Successfully froze cgroup 
> /sys/fs/cgroup/freezer/mesos/db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3/mesos/0e44d4d7-629f-41f1-80df-4aae9583d133/mesos/e6e01854-40a0-4da3-b458-2b4cf52bbc11
>  after 3.814144ms
> 2018-04-16 12:37:54: I0416 12:37:54.250550  3853 cgroups.cpp:3078] Thawing 
> cgroup 
> /sys/fs/cgroup/freezer/mesos/db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3/mesos/0e44d4d7-629f-41f1-80df-4aae9583d133/mesos/e6e01854-40a0-4da3-b458-2b4cf52bbc11
> 2018-04-16 12:37:54: I0416 12:37:54.256599  3853 cgroups.cpp:1444] 
> Successfully thawed cgroup 
> /sys/fs/cgroup/freezer/mesos/db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3/mesos/0e44d4d7-629f-41f1-80df-4aae9583d133/mesos/e6e01854-40a0-4da3-b458-2b4cf52bbc11
>  after 5.977856ms
> ...
> 2018-04-16 12:37:54: I0416 12:37:54.371117  3837 http.cpp:3502] Processing 
> LAUNCH_NESTED_CONTAINER_SESSION call for container 
> 'db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133.2bfd8eed-b528-493b-8434-04311e453dcd'
> 2018-04-16 12:37:54: W0416 12:37:54.371692  3842 http.cpp:2758] Failed to 
> launch container 
> db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133.2bfd8eed-b528-493b-8434-04311e453dcd:
>  Parent container 
> db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133 is 
> in 'DESTROYING' state
> 2018-04-16 12:37:54: W0416 12:37:54.371826  3840 containerizer.cpp:2337] 
> Attempted to destroy unknown container 
> db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133.2bfd8eed-b528-493b-8434-04311e453dcd
> ...
> 2018-04-16 12:37:55: I0416 12:37:55.504456  3856 http.cpp:3078] Processing 
> REMOVE_NESTED_CONTAINER call for container 
> 'db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133.check-f3a1238c-7f0f-4db3-bda4-c0ea951d46b6'
> ...
> 2018-04-16 12:37:55: I0416 12:37:55.556367  3857 http.cpp:3502] Processing 
> LAUNCH_NESTED_CONTAINER_SESSION call for container 
> 'db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133.check-0db8bd89-6f19-48c6-a69f-40196b4bc211'
> ...
> 2018-04-16 12:37:55: W0416 12:37:55.582137  3850 http.cpp:2758] Failed to 
> launch container 
> db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133.check-0db8bd89-6f19-48c6-a69f-40196b4bc211:
>  Parent container 
> db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133 is 
> in 'DESTROYING' state
> ...
> 2018-04-16 12:37:55: W0416 12:37:55.583330  3844 containerizer.cpp:2337] 
> Attempted to destroy unknown container 
> db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133.check-0db8bd89-6f19-48c6-a69f-40196b4bc211
> ...
> {noformat}
> This stops when the framework reconciles and instructs Mesos to kill the 
> task. Which also results in a
> {noformat}
> 2018-04-16 13:06:04: I0416 13:06:04.161623  3843 http.cpp:2966] Processing 
> KILL_NESTED_CONTAINER call for container 
> 'db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133'
> {noformat}
> Nothing else related to this container is logged following this line.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to