[jira] [Commented] (MESOS-7748) Slow subscribers of streaming APIs can lead to Mesos OOMing.

Alexander Rukletsov (JIRA) Tue, 15 Aug 2017 04:25:44 -0700

    [ 
https://issues.apache.org/jira/browse/MESOS-7748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16127095#comment-16127095
 ]


Alexander Rukletsov commented on MESOS-7748:
--------------------------------------------

This problem described in this ticket is well studied: [TCP/IP Orphaned 
Connections 
Vulnerability|http://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2009-1926], 
[Slow read DoS/DDos 
attack|https://blog.qualys.com/securitylabs/2012/01/05/slow-read], [TCP receive 
window closed indefinitely|https://www.kb.cert.org/vuls/id/723308].

There are several things to consider regarding this attack:
* Does the attacker read slowly or stop reading at all at some point, e.g., 
when its TCP buffer overflows?
* Are there multiple attackers from different IP addresses?
* What is the "cost", i.e., memory, CPU, of a stalled connection?

The general recommendation of the IETF [TCP Maintenance and Minor 
Extensions|http://www.ietf.org/dyn/wg/charter/tcpm-charter.html] working group 
is to [selectively abort TCP connections that appear to be malicious under 
resource exhaustion conditions|https://www.kb.cert.org/vuls/id/723308]. 
Detecting misbehaving HTTP connections is not a trivial task; any solutions is 
trade-off between improved resiliency and decreased QoS. 

Here are the most popular practical mitigation startegies (in order of 
increasing complexity):
* Absolute connection timeout, e.g., [Go HTTP 
library|https://golang.org/pkg/net/#Conn], see [#\[1\]] for more details.
* Idle connection timeout, write timeout, e.g., [Lighttpd| 
https://redmine.lighttpd.net/projects/1/wiki/Server_max-write-idleDetails]. 
[Some 
sources|https://www.academia.edu/9346526/Analysis_of_Slow_Read_DoS_Attack_and_Countermeasures]
 suggest at least 10 seconds in order to maintain reasonable QoS. 
* Max clients per IP address, e.g., [ModSecurity in Apache| 
https://github.com/SpiderLabs/ModSecurity/wiki/Reference-Manual#secconnwritest
* Data transfer rate, e.g., [Barracuda Load Balancers| 
https://campus.barracuda.com/product/campus/article/display/LBADCv50/17106014/]
* Incremental (adaptive) response timeout, e.g., [Barracuda Load Balancers| 
https://campus.barracuda.com/product/campus/article/display/LBADCv50/17106014/]

{anchor:1} \[1\] I've played a little bit with Go HTTP library, see the test 
binary [here|https://github.com/rukletsov/http-stream-test]. The low level 
[connection class|https://golang.org/pkg/net/#Conn] performs [blocking 
writes|https://golang.org/src/net/net.go?s=6546:6589#L179]. Connection 
timeouts, called [deadlines|https://golang.org/pkg/net/#Conn], can be applied 
for a connection, not for a single write / read operation. Idle timeouts can be 
implemented by regularly extending deadlines.

A high level [HTTP server class|https://golang.org/pkg/net/http/#Server] 
defines write and read timeouts, that are transformed into deadlines. However, 
deadlines are refreshed only when a new request comes in, meaning an indefinite 
(or long enough) streamed write is interrupted after the timeout. The suggested 
solution seems to hijack the connection and implement writing and buffering 
logic on the application level.

> Slow subscribers of streaming APIs can lead to Mesos OOMing.
> ------------------------------------------------------------
>
>                 Key: MESOS-7748
>                 URL: https://issues.apache.org/jira/browse/MESOS-7748
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: Alexander Rukletsov
>            Assignee: Alexander Rukletsov
>            Priority: Critical
>              Labels: mesosphere, reliability
>
> For each active subscriber, Mesos master / slave maintains an event queue, 
> which grows over time if the subscriber does not read fast enough. As the 
> number of such "slow" subscribers grows, so does Mesos master / slave memory 
> consumption, which might lead to an OOM event.
> Ideas to consider:
> * Restrict the number of subscribers for the streaming APIs
> * Check (ping) for inactive or "slow" subscribers
> * Disconnect the subscriber when there are too many queued events in memory



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (MESOS-7748) Slow subscribers of streaming APIs can lead to Mesos OOMing.

Reply via email to