[
https://issues.apache.org/jira/browse/MESOS-7748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16127095#comment-16127095
]
Alexander Rukletsov commented on MESOS-7748:
--------------------------------------------
This problem described in this ticket is well studied: [TCP/IP Orphaned
Connections
Vulnerability|http://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2009-1926],
[Slow read DoS/DDos
attack|https://blog.qualys.com/securitylabs/2012/01/05/slow-read], [TCP receive
window closed indefinitely|https://www.kb.cert.org/vuls/id/723308].
There are several things to consider regarding this attack:
* Does the attacker read slowly or stop reading at all at some point, e.g.,
when its TCP buffer overflows?
* Are there multiple attackers from different IP addresses?
* What is the "cost", i.e., memory, CPU, of a stalled connection?
The general recommendation of the IETF [TCP Maintenance and Minor
Extensions|http://www.ietf.org/dyn/wg/charter/tcpm-charter.html] working group
is to [selectively abort TCP connections that appear to be malicious under
resource exhaustion conditions|https://www.kb.cert.org/vuls/id/723308].
Detecting misbehaving HTTP connections is not a trivial task; any solutions is
trade-off between improved resiliency and decreased QoS.
Here are the most popular practical mitigation startegies (in order of
increasing complexity):
* Absolute connection timeout, e.g., [Go HTTP
library|https://golang.org/pkg/net/#Conn], see [#\[1\]] for more details.
* Idle connection timeout, write timeout, e.g., [Lighttpd|
https://redmine.lighttpd.net/projects/1/wiki/Server_max-write-idleDetails].
[Some
sources|https://www.academia.edu/9346526/Analysis_of_Slow_Read_DoS_Attack_and_Countermeasures]
suggest at least 10 seconds in order to maintain reasonable QoS.
* Max clients per IP address, e.g., [ModSecurity in Apache|
https://github.com/SpiderLabs/ModSecurity/wiki/Reference-Manual#secconnwritest
* Data transfer rate, e.g., [Barracuda Load Balancers|
https://campus.barracuda.com/product/campus/article/display/LBADCv50/17106014/]
* Incremental (adaptive) response timeout, e.g., [Barracuda Load Balancers|
https://campus.barracuda.com/product/campus/article/display/LBADCv50/17106014/]
{anchor:1} \[1\] I've played a little bit with Go HTTP library, see the test
binary [here|https://github.com/rukletsov/http-stream-test]. The low level
[connection class|https://golang.org/pkg/net/#Conn] performs [blocking
writes|https://golang.org/src/net/net.go?s=6546:6589#L179]. Connection
timeouts, called [deadlines|https://golang.org/pkg/net/#Conn], can be applied
for a connection, not for a single write / read operation. Idle timeouts can be
implemented by regularly extending deadlines.
A high level [HTTP server class|https://golang.org/pkg/net/http/#Server]
defines write and read timeouts, that are transformed into deadlines. However,
deadlines are refreshed only when a new request comes in, meaning an indefinite
(or long enough) streamed write is interrupted after the timeout. The suggested
solution seems to hijack the connection and implement writing and buffering
logic on the application level.
> Slow subscribers of streaming APIs can lead to Mesos OOMing.
> ------------------------------------------------------------
>
> Key: MESOS-7748
> URL: https://issues.apache.org/jira/browse/MESOS-7748
> Project: Mesos
> Issue Type: Bug
> Reporter: Alexander Rukletsov
> Assignee: Alexander Rukletsov
> Priority: Critical
> Labels: mesosphere, reliability
>
> For each active subscriber, Mesos master / slave maintains an event queue,
> which grows over time if the subscriber does not read fast enough. As the
> number of such "slow" subscribers grows, so does Mesos master / slave memory
> consumption, which might lead to an OOM event.
> Ideas to consider:
> * Restrict the number of subscribers for the streaming APIs
> * Check (ping) for inactive or "slow" subscribers
> * Disconnect the subscriber when there are too many queued events in memory
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)