[ 
https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15056411#comment-15056411
 ] 

Ariel Weisberg edited comment on CASSANDRA-9318 at 12/14/15 6:20 PM:
---------------------------------------------------------------------

Quick note. 65k mutations pending in the mutation stage. 7 memtables pending 
flush. [I hooked memtables pending flush into the backpressure 
mechanism.|https://github.com/apache/cassandra/commit/494eabf48ab48f1e86c058c0b583166ab39dcc39]
 That absolutely wrecked performance as throughput dropped to 0 zero 
periodically, but throughput is infinitely higher than when the database has 
OOMed.

Kicked off a few performance runs to demonstrate what happens when you do have 
backpressure and you try various large limits on in flight memtables/requests.

[9318 w/backpressure 64m 8g heap memtables 
count|http://cstar.datastax.com/tests/id/fa769eec-a283-11e5-bbc9-0256e416528f]
[9318 w/backpressure 1g 8g heap memtables 
count|http://cstar.datastax.com/tests/id/4c52dd6e-a286-11e5-bbc9-0256e416528f]
[9318 w/backpressure 2g 8g heap memtables 
count|http://cstar.datastax.com/tests/id/b3d5b470-a286-11e5-bbc9-0256e416528f]

I am setting the point where backpressure turns off to almost the same limit as 
to when it turns on. This is smooths out performance just enough for stress to 
not constantly emit huge numbers of errors as writes time out because the 
database stops serving requests for a long time waiting for a memtable to flush.

With pressure from memtables somewhat accounted for the remaining source of 
pressure that can bring down a node is remotely delivered mutations. I can 
throw those into the calculation and add a listener that blocks reads from 
other cluster nodes. It's a nasty thing to do, but maybe not that different 
from OOM.

I am going to hack together something to force a node to be slow so I can 
demonstrate overwhelming it with remotely delivered mutations first.


was (Author: aweisberg):
Quick note. 65k mutations pending in the mutation stage. 7 memtables pending 
flush. [I hooked memtables pending flush into the backpressure 
mechanism.|https://github.com/apache/cassandra/commit/494eabf48ab48f1e86c058c0b583166ab39dcc39]
 That absolutely wrecked performance as throughput dropped to 0 zero 
periodically, but throughput is infinitely higher than when the database hasn't 
OOMed.

Kicked off a few performance runs to demonstrate what happens when you do have 
backpressure and you try various large limits on in flight memtables/requests.

[9318 w/backpressure 64m 8g heap memtables 
count|http://cstar.datastax.com/tests/id/fa769eec-a283-11e5-bbc9-0256e416528f]
[9318 w/backpressure 1g 8g heap memtables 
count|http://cstar.datastax.com/tests/id/4c52dd6e-a286-11e5-bbc9-0256e416528f]
[9318 w/backpressure 2g 8g heap memtables 
count|http://cstar.datastax.com/tests/id/b3d5b470-a286-11e5-bbc9-0256e416528f]

I am setting the point where backpressure turns off to almost the same limit as 
to when it turns on. This is smooths out performance just enough for stress to 
not constantly emit huge numbers of errors as writes time out because the 
database stops serving requests for a long time waiting for a memtable to flush.

With pressure from memtables somewhat accounted for the remaining source of 
pressure that can bring down a node is remotely delivered mutations. I can 
throw those into the calculation and add a listener that blocks reads from 
other cluster nodes. It's a nasty thing to do, but maybe not that different 
from OOM.

I am going to hack together something to force a node to be slow so I can 
demonstrate overwhelming it with remotely delivered mutations first.

> Bound the number of in-flight requests at the coordinator
> ---------------------------------------------------------
>
>                 Key: CASSANDRA-9318
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-9318
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Local Write-Read Paths, Streaming and Messaging
>            Reporter: Ariel Weisberg
>            Assignee: Ariel Weisberg
>             Fix For: 2.1.x, 2.2.x
>
>
> It's possible to somewhat bound the amount of load accepted into the cluster 
> by bounding the number of in-flight requests and request bytes.
> An implementation might do something like track the number of outstanding 
> bytes and requests and if it reaches a high watermark disable read on client 
> connections until it goes back below some low watermark.
> Need to make sure that disabling read on the client connection won't 
> introduce other issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to