[ 
https://issues.apache.org/jira/browse/MESOS-8545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16596258#comment-16596258
 ] 

Alexander Rukletsov edited comment on MESOS-8545 at 8/29/18 3:01 PM:
---------------------------------------------------------------------

When the agent handles {{ATTACH_CONTAINER_INPUT}} call, it creates an HTTP 
[streaming 
connection|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/src/slave/http.cpp#L3104]
 to IOSwitchboard.
 After the agent 
[sends|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/src/slave/http.cpp#L3141]
 a request to IOSwitchboard, a new instance of {{ConnectionProcess}} is 
created, which calls 
[{{ConnectionProcess::read()}}|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/3rdparty/libprocess/src/http.cpp#L1220]
 to read an HTTP response from IOSwitchboard.
 If the socket is closed before a `\r\n\r\n` response is received, the 
{{ConnectionProcess}} calls 
`[disconnect()|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/3rdparty/libprocess/src/http.cpp#L1326]`,
 which in turn [flushes 
`pipeline`|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/3rdparty/libprocess/src/http.cpp#L1197-L1201]
 containing a {{Response}} promise. This leads to responding back (to the 
{{AttachInputToNestedContainerSession}} 
[test|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/src/tests/api_tests.cpp#L7942-L7943])
 an {{HTTP 500}} error with body "Disconnected".

When io redirect finishes, IOSwitchboardServerProcess calls {{terminate(self(), 
false)}} (here 
[\[1\]|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/src/slave/containerizer/mesos/io/switchboard.cpp#L1262]
 or there 
[\[2\]|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/src/slave/containerizer/mesos/io/switchboard.cpp#L1713]).
 Then, {{IOSwitchboardServerProcess::finalize()}} sets a value to the 
[`promise`|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/src/slave/containerizer/mesos/io/switchboard.cpp#L1304-L1308],
 which [unblocks 
{{main()}}|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/src/slave/containerizer/mesos/io/switchboard_main.cpp#L149-L150]
 function. As a result, IOSwitchboard process terminates immediately.

When IOSwitchboard terminates, there could be not yet 
[written|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/3rdparty/libprocess/src/http.cpp#L1699]
 response messages to the socket. So, if any delay occurs before 
[sending|https://github.com/apache/mesos/blob/95bbe784da51b3a7eaeb9127e2541ea0b2af07b5/3rdparty/libprocess/src/http.cpp#L1742-L1748]
 the response back to the agent, the socket will be closed due to IOSwitchboard 
process termination. That leads to the aforementioned premature socket close in 
the agent.

See my previous comment which includes steps to reproduce the bug.


was (Author: abudnik):
When the agent handles `ATTACH_CONTAINER_INPUT` call, it creates an HTTP 
[streaming 
connection|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/src/slave/http.cpp#L3104]
 to IOSwitchboard.
 After the agent 
[sends|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/src/slave/http.cpp#L3141]
 a request to IOSwitchboard, a new instance of `ConnectionProcess` is created, 
which calls 
[`ConnectionProcess::read()`|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/3rdparty/libprocess/src/http.cpp#L1220]
 to read an HTTP response from IOSwitchboard.
 If the socket is closed before a `\r\n\r\n` response is received, the 
`ConnectionProcess` calls 
`[disconnect()|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/3rdparty/libprocess/src/http.cpp#L1326]`,
 which in turn [flushes 
`pipeline`|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/3rdparty/libprocess/src/http.cpp#L1197-L1201]
 containing a `Response` promise. This leads to responding back (to the 
`AttachInputToNestedContainerSession` 
[test|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/src/tests/api_tests.cpp#L7942-L7943])
 an `HTTP 500` error with body "Disconnected".

When io redirect finishes, IOSwitchboardServerProcess calls `terminate(self(), 
false)` (here 
[\[1\]|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/src/slave/containerizer/mesos/io/switchboard.cpp#L1262]
 or there 
[\[2\]|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/src/slave/containerizer/mesos/io/switchboard.cpp#L1713]).
 Then, `IOSwitchboardServerProcess::finalize()` sets a value to the 
[`promise`|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/src/slave/containerizer/mesos/io/switchboard.cpp#L1304-L1308],
 which [unblocks 
`main()`|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/src/slave/containerizer/mesos/io/switchboard_main.cpp#L149-L150]
 function. As a result, IOSwitchboard process terminates immediately.

When IOSwitchboard terminates, there could be not yet 
[written|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/3rdparty/libprocess/src/http.cpp#L1699]
 response messages to the socket. So, if any delay occurs before 
[sending|https://github.com/apache/mesos/blob/95bbe784da51b3a7eaeb9127e2541ea0b2af07b5/3rdparty/libprocess/src/http.cpp#L1742-L1748]
 the response back to the agent, the socket will be closed due to IOSwitchboard 
process termination. That leads to the aforementioned premature socket close in 
the agent.

See my previous comment which includes steps to reproduce the bug.

> AgentAPIStreamingTest.AttachInputToNestedContainerSession is flaky.
> -------------------------------------------------------------------
>
>                 Key: MESOS-8545
>                 URL: https://issues.apache.org/jira/browse/MESOS-8545
>             Project: Mesos
>          Issue Type: Bug
>          Components: agent
>    Affects Versions: 1.5.0, 1.6.1, 1.7.0
>            Reporter: Andrei Budnik
>            Assignee: Andrei Budnik
>            Priority: Major
>              Labels: Mesosphere, flaky-test
>         Attachments: 
> AgentAPIStreamingTest.AttachInputToNestedContainerSession-badrun.txt, 
> AgentAPIStreamingTest.AttachInputToNestedContainerSession-badrun2.txt
>
>
> {code:java}
> I0205 17:11:01.091872 4898 http_proxy.cpp:132] Returning '500 Internal Server 
> Error' for '/slave(974)/api/v1' (Disconnected)
> /home/centos/workspace/mesos/Mesos_CI-build/FLAG/CMake/label/mesos-ec2-centos-7/mesos/src/tests/api_tests.cpp:6596:
>  Failure
> Value of: (response).get().status
> Actual: "500 Internal Server Error"
> Expected: http::OK().status
> Which is: "200 OK"
> Body: "Disconnected"
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to