[jira] [Commented] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15506868#comment-15506868 ] Sergio Bossa commented on CASSANDRA-9318: - Yay! Thanks everyone (but [~Stefania] in particular) for the great feedback and reviews :) > Bound the number of in-flight requests at the coordinator > - > > Key: CASSANDRA-9318 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9318 > Project: Cassandra > Issue Type: Improvement > Components: Local Write-Read Paths, Streaming and Messaging >Reporter: Ariel Weisberg >Assignee: Sergio Bossa > Fix For: 3.10 > > Attachments: 9318-3.0-nits-trailing-spaces.patch, backpressure.png, > limit.btm, no_backpressure.png > > > It's possible to somewhat bound the amount of load accepted into the cluster > by bounding the number of in-flight requests and request bytes. > An implementation might do something like track the number of outstanding > bytes and requests and if it reaches a high watermark disable read on client > connections until it goes back below some low watermark. > Need to make sure that disabling read on the client connection won't > introduce other issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15505212#comment-15505212 ] Stefania commented on CASSANDRA-9318: - Committed to trunk as d43b9ce5092f8879a1a66afebab74d86e9e127fb, thank you for your excellent patch [~sbtourist]! > Bound the number of in-flight requests at the coordinator > - > > Key: CASSANDRA-9318 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9318 > Project: Cassandra > Issue Type: Improvement > Components: Local Write-Read Paths, Streaming and Messaging >Reporter: Ariel Weisberg >Assignee: Sergio Bossa > Attachments: 9318-3.0-nits-trailing-spaces.patch, backpressure.png, > limit.btm, no_backpressure.png > > > It's possible to somewhat bound the amount of load accepted into the cluster > by bounding the number of in-flight requests and request bytes. > An implementation might do something like track the number of outstanding > bytes and requests and if it reaches a high watermark disable read on client > connections until it goes back below some low watermark. > Need to make sure that disabling read on the client connection won't > introduce other issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15503635#comment-15503635 ] Sergio Bossa commented on CASSANDRA-9318: - Thanks [~Stefania], looks good! > Bound the number of in-flight requests at the coordinator > - > > Key: CASSANDRA-9318 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9318 > Project: Cassandra > Issue Type: Improvement > Components: Local Write-Read Paths, Streaming and Messaging >Reporter: Ariel Weisberg >Assignee: Sergio Bossa > Attachments: 9318-3.0-nits-trailing-spaces.patch, backpressure.png, > limit.btm, no_backpressure.png > > > It's possible to somewhat bound the amount of load accepted into the cluster > by bounding the number of in-flight requests and request bytes. > An implementation might do something like track the number of outstanding > bytes and requests and if it reaches a high watermark disable read on client > connections until it goes back below some low watermark. > Need to make sure that disabling read on the client connection won't > introduce other issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15502388#comment-15502388 ] Stefania commented on CASSANDRA-9318: - Latest dtest [build|https://cassci.datastax.com/view/Dev/view/sbtourist/job/sbtourist-CASSANDRA-9318-trunk-dtest/7/] completed without failures. > Bound the number of in-flight requests at the coordinator > - > > Key: CASSANDRA-9318 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9318 > Project: Cassandra > Issue Type: Improvement > Components: Local Write-Read Paths, Streaming and Messaging >Reporter: Ariel Weisberg >Assignee: Sergio Bossa > Attachments: 9318-3.0-nits-trailing-spaces.patch, backpressure.png, > limit.btm, no_backpressure.png > > > It's possible to somewhat bound the amount of load accepted into the cluster > by bounding the number of in-flight requests and request bytes. > An implementation might do something like track the number of outstanding > bytes and requests and if it reaches a high watermark disable read on client > connections until it goes back below some low watermark. > Need to make sure that disabling read on the client connection won't > introduce other issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15502215#comment-15502215 ] Stefania commented on CASSANDRA-9318: - Thanks [~sbtourist], the latest commits and the entire patch LGTM. The test failures are all unrelated and we have tickets for all of them: CASSANDRA-12664, CASSANDRA-12656 and CASSANDRA-12140. I've launched one more dtest build to cover the final commit and to hopefully shake off the CASSANDRA-12656 failures since these tests shouldn't even be running now. I've squashed your entire patch into [one commit|https://github.com/stef1927/cassandra/commit/1632f2e9892624f611ac3629fb84a82594fec726] and fixed some formatting issues (mostly trailing spaces) [here|https://github.com/stef1927/cassandra/commit/e3346e5f5a49b2933e10a84405730] on this [branch|https://github.com/stef1927/cassandra/commits/9318]. If you could double check the formatting nits, I can squash them and commit once the final dtest build has also completed. > Bound the number of in-flight requests at the coordinator > - > > Key: CASSANDRA-9318 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9318 > Project: Cassandra > Issue Type: Improvement > Components: Local Write-Read Paths, Streaming and Messaging >Reporter: Ariel Weisberg >Assignee: Sergio Bossa > Attachments: 9318-3.0-nits-trailing-spaces.patch, backpressure.png, > limit.btm, no_backpressure.png > > > It's possible to somewhat bound the amount of load accepted into the cluster > by bounding the number of in-flight requests and request bytes. > An implementation might do something like track the number of outstanding > bytes and requests and if it reaches a high watermark disable read on client > connections until it goes back below some low watermark. > Need to make sure that disabling read on the client connection won't > introduce other issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15493898#comment-15493898 ] Sergio Bossa commented on CASSANDRA-9318: - [~Stefania], I fixed the tests related to the new {{DatabaseDescriptor}} initialization methods. I've also addressed [~slebresne]'s concerns and modified the back-pressure algorithm to always observe the write timeout, and if the rate limit causes it to be exceeded, rather than observe the rate limit, just pause up to the timeout _minus_ the current response time from the replica with the lower rate: this is to avoid client timeouts and also give enough time to replicas to actually acknowledge the mutations (at the expense of having more inflight mutations than the rate limit, but I believe this is the right tradeoff). I've run several round of tests and dtests: tests are always green, but some dtests always fail intermittently; those failures do not seem related to this issue, but someone else more familiar with the failing dtests might want to have a look. Finally, I've re-run some manual stress tests on an overloaded 4 nodes RF=3 cluster, and here are the results of inserting 1M rows at CL.ONE: \\ \\ * SLOW back-pressure. ||Node||Dropped Mutations||Dropped Hints|| |1|18143|0| |2|10|0| |3|0|0| |4|0|0| Timeouts: 39 Total runtime: 20 mins * No back-pressure ||Node||Dropped Mutations||Dropped Hints|| |1|471751|248403| |2|70996|13571| |3|640|0| |4|75318|24801| Timeouts: 6 Total runtime: 5 mins At CL.QUORUM: \\ \\ * SLOW back-pressure. ||Node||Dropped Mutations||Dropped Hints|| |1|27781|8584| |2|4650|0| |3|0|0| |4|0|0| Timeouts: 37 Total runtime: 17 mins * No back-pressure ||Node||Dropped Mutations||Dropped Hints|| |1|353972|133429| |2|258776|81981| |3|636|0| |4|13870|1710| Timeouts: 74 Total runtime: 6 mins > Bound the number of in-flight requests at the coordinator > - > > Key: CASSANDRA-9318 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9318 > Project: Cassandra > Issue Type: Improvement > Components: Local Write-Read Paths, Streaming and Messaging >Reporter: Ariel Weisberg >Assignee: Sergio Bossa > Attachments: 9318-3.0-nits-trailing-spaces.patch, backpressure.png, > limit.btm, no_backpressure.png > > > It's possible to somewhat bound the amount of load accepted into the cluster > by bounding the number of in-flight requests and request bytes. > An implementation might do something like track the number of outstanding > bytes and requests and if it reaches a high watermark disable read on client > connections until it goes back below some low watermark. > Need to make sure that disabling read on the client connection won't > introduce other issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15490107#comment-15490107 ] Sergio Bossa commented on CASSANDRA-9318: - bq. It's possible I'm totally missing a point you and Stefania are trying to make, but it seems to me to be the only reasonable way. The timeout is a deadline the user asks us to respect, it's the whole point of it, so it should always be respected as strictly as possible. I didn't follow the two issues mentioned above: if that's the end goal, I agree we should be strict with it. bq. that discussion seems to suggest that back-pressure would make it harder for C* to respect a reasonable timeout. I'll admit that sounds counter-intuitive to me as a functioning back-pressure should make it easier by smoothing things over when there is too much pressure. The load is smoothed on the server, then it depends on how many replicas are "in trouble" and how much aggressive clients are. As an example, say the timeout is 2s, the request incoming rate at the coordinator is 1000/s, the processing rate at replica A is 50/s and at replica B is 1000/s, then with CL.ONE (assuming the coordinator is not part of the replica group for simplicity): 1) If back-pressure is disabled, we get no client timeouts, but ~900 mutations dropped on replica A. 2) If back-pressure is enabled, the back-pressure rate limiting at the coordinator is set at 50/s (assuming the SLOW configuration) to smooth the load between servers, which means ~900 mutations will end up in client timeouts, and it will be the client responsibility to back down to a saner ingestion rate; that is, if it keeps ingesting at a higher rate, there's nothing we can do to smooth its side of the equation. In case #2, the client timeout can be seen as a signal for the client to slow down, so I'm fine with that (we also hinted in the past at adding more back-pressure related information to the exception, but it seems this requires a change to the native protocol?). That said, this is a bit of an edge case: most of the time, when there is such difference between replica responsiveness, it's because of transient short-lived events such as GC or compaction spikes, and the back-pressure algorithm will not catch that, as it's meant to react to continuous overloading. On the other hand, when the node is continuously overloaded by clients, most of the time all replicas will suffer from that, and the back-pressure will smooth out the load; in such case, the rate of client timeouts shouldn't really change much, but I'll do another test with such new changes. Hope this helps clarifying things a little bit. > Bound the number of in-flight requests at the coordinator > - > > Key: CASSANDRA-9318 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9318 > Project: Cassandra > Issue Type: Improvement > Components: Local Write-Read Paths, Streaming and Messaging >Reporter: Ariel Weisberg >Assignee: Sergio Bossa > Attachments: 9318-3.0-nits-trailing-spaces.patch, backpressure.png, > limit.btm, no_backpressure.png > > > It's possible to somewhat bound the amount of load accepted into the cluster > by bounding the number of in-flight requests and request bytes. > An implementation might do something like track the number of outstanding > bytes and requests and if it reaches a high watermark disable read on client > connections until it goes back below some low watermark. > Need to make sure that disabling read on the client connection won't > introduce other issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15489927#comment-15489927 ] Sylvain Lebresne commented on CASSANDRA-9318: - bq. If back pressure kicks in, then it may be that mutations are applied by all replicas but the application receives a timeout nonetheless. That's *not* specific to back-pressure and can perfectly happen today and that's fine. Again, if your application *require* that the server answers within say 500ms, the server should do so and not randomly fail that deadline because back-pressure kicks-in. If the deadline you set as a user is too low and you get timeouts too often, then it's your problem and you should either reconsider your deadline because it's unrealistic, or increase your cluster capacity. But I genuinely don't understand how not doing what the user explicitly asks us to do would ever make sense. bq. 1) We keep the timeout strict and change the back-pressure implementation to wait at most the given timeout, and fail otherwise. It's possible I'm totally missing a point you and Stefania are trying to make, but it seems to me to be the only reasonable way. The timeout is a deadline the user asks us to respect, it's the whole point of it, so it should always be respected as strictly as possible. If the user sets it too low, his mistake (and don't get me wrong, we should certainly educate user to avoid that mistake through documentation as much as possible). More generally though, that discussion seems to suggest that back-pressure would make it harder for C* to respect a reasonable timeout. I'll admit that sounds counter-intuitive to me as a functioning back-pressure should make it _easier_ by smoothing things over when there is too much pressure. > Bound the number of in-flight requests at the coordinator > - > > Key: CASSANDRA-9318 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9318 > Project: Cassandra > Issue Type: Improvement > Components: Local Write-Read Paths, Streaming and Messaging >Reporter: Ariel Weisberg >Assignee: Sergio Bossa > Attachments: 9318-3.0-nits-trailing-spaces.patch, backpressure.png, > limit.btm, no_backpressure.png > > > It's possible to somewhat bound the amount of load accepted into the cluster > by bounding the number of in-flight requests and request bytes. > An implementation might do something like track the number of outstanding > bytes and requests and if it reaches a high watermark disable read on client > connections until it goes back below some low watermark. > Need to make sure that disabling read on the client connection won't > introduce other issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15489860#comment-15489860 ] Sergio Bossa commented on CASSANDRA-9318: - [~slebresne], I do understand your concerns. I see [~Stefania] largely answered them already, but it all depends on how strict we want to be in regards to CASSANDRA-12256 and CASSANDRA-2848, so I propose one of the following: 1) We keep the timeout strict and change the back-pressure implementation to wait at most the given timeout, and fail otherwise. 2) We add a back-pressure configuration parameter to have the users choose if they want a strict timeout even in case of back-pressure, or they want it "adaptive". Thoughts? > Bound the number of in-flight requests at the coordinator > - > > Key: CASSANDRA-9318 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9318 > Project: Cassandra > Issue Type: Improvement > Components: Local Write-Read Paths, Streaming and Messaging >Reporter: Ariel Weisberg >Assignee: Sergio Bossa > Attachments: 9318-3.0-nits-trailing-spaces.patch, backpressure.png, > limit.btm, no_backpressure.png > > > It's possible to somewhat bound the amount of load accepted into the cluster > by bounding the number of in-flight requests and request bytes. > An implementation might do something like track the number of outstanding > bytes and requests and if it reaches a high watermark disable read on client > connections until it goes back below some low watermark. > Need to make sure that disabling read on the client connection won't > introduce other issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15489813#comment-15489813 ] Stefania commented on CASSANDRA-9318: - If back pressure kicks in, then it may be that mutations are applied by all replicas but the application receives a timeout nonetheless. The timeout may even be already expired after back-pressure is applied but before sending mutations to the replicas. So in this case, I think it would make sense to have a timeout long enough to take into account the effects of back-pressure when the system is overloaded. > Bound the number of in-flight requests at the coordinator > - > > Key: CASSANDRA-9318 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9318 > Project: Cassandra > Issue Type: Improvement > Components: Local Write-Read Paths, Streaming and Messaging >Reporter: Ariel Weisberg >Assignee: Sergio Bossa > Attachments: 9318-3.0-nits-trailing-spaces.patch, backpressure.png, > limit.btm, no_backpressure.png > > > It's possible to somewhat bound the amount of load accepted into the cluster > by bounding the number of in-flight requests and request bytes. > An implementation might do something like track the number of outstanding > bytes and requests and if it reaches a high watermark disable read on client > connections until it goes back below some low watermark. > Need to make sure that disabling read on the client connection won't > introduce other issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15489764#comment-15489764 ] Sylvain Lebresne commented on CASSANDRA-9318: - bq. If we don't do this, then users have to manually increase the timeout when enabling back-pressure, by an amount of time that is not known a priori. Why? Again, we want the timeout to mean, in the context of CASSANDRA-12256 and CASSANDRA-2848, is an upper on the server answering to the application (even if that means giving up). In other words, the timeout is about what the application needs, and that shouldn't be influenced by having back-pressure on or not. Besides, back-pressure should be largely transparent from the user point of view, at least as long as nothing goes too bad (and it should hopefully make things better otherwise, not worse), so it makes no sense from the user point of view to have to bump the timeout just because back-pressure is enabled. > Bound the number of in-flight requests at the coordinator > - > > Key: CASSANDRA-9318 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9318 > Project: Cassandra > Issue Type: Improvement > Components: Local Write-Read Paths, Streaming and Messaging >Reporter: Ariel Weisberg >Assignee: Sergio Bossa > Attachments: 9318-3.0-nits-trailing-spaces.patch, backpressure.png, > limit.btm, no_backpressure.png > > > It's possible to somewhat bound the amount of load accepted into the cluster > by bounding the number of in-flight requests and request bytes. > An implementation might do something like track the number of outstanding > bytes and requests and if it reaches a high watermark disable read on client > connections until it goes back below some low watermark. > Need to make sure that disabling read on the client connection won't > introduce other issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15489726#comment-15489726 ] Stefania commented on CASSANDRA-9318: - bq. I don't why having that being violated when back-pressure would make sense. It just seem like a gotcha for users that I'd rather avoid. If we don't do this, then users have to manually increase the timeout when enabling back-pressure, by an amount of time that is not known a priori. So both scenarios are not ideal. > Bound the number of in-flight requests at the coordinator > - > > Key: CASSANDRA-9318 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9318 > Project: Cassandra > Issue Type: Improvement > Components: Local Write-Read Paths, Streaming and Messaging >Reporter: Ariel Weisberg >Assignee: Sergio Bossa > Attachments: 9318-3.0-nits-trailing-spaces.patch, backpressure.png, > limit.btm, no_backpressure.png > > > It's possible to somewhat bound the amount of load accepted into the cluster > by bounding the number of in-flight requests and request bytes. > An implementation might do something like track the number of outstanding > bytes and requests and if it reaches a high watermark disable read on client > connections until it goes back below some low watermark. > Need to make sure that disabling read on the client connection won't > introduce other issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15489709#comment-15489709 ] Sylvain Lebresne commented on CASSANDRA-9318: - bq. with back-pressure enabled the time spent paused in back-pressure is not counted against I don't understand that rational. The goal of CASSANDRA-12256 is to make sure client can rely on the timeout being an upper bound on when they get an answer from the server. I don't why having that being violated when back-pressure would make sense. It just seem like a gotcha for users that I'd rather avoid. > Bound the number of in-flight requests at the coordinator > - > > Key: CASSANDRA-9318 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9318 > Project: Cassandra > Issue Type: Improvement > Components: Local Write-Read Paths, Streaming and Messaging >Reporter: Ariel Weisberg >Assignee: Sergio Bossa > Attachments: 9318-3.0-nits-trailing-spaces.patch, backpressure.png, > limit.btm, no_backpressure.png > > > It's possible to somewhat bound the amount of load accepted into the cluster > by bounding the number of in-flight requests and request bytes. > An implementation might do something like track the number of outstanding > bytes and requests and if it reaches a high watermark disable read on client > connections until it goes back below some low watermark. > Need to make sure that disabling read on the client connection won't > introduce other issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15489654#comment-15489654 ] Stefania commented on CASSANDRA-9318: - Sounds good, thanks. > Bound the number of in-flight requests at the coordinator > - > > Key: CASSANDRA-9318 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9318 > Project: Cassandra > Issue Type: Improvement > Components: Local Write-Read Paths, Streaming and Messaging >Reporter: Ariel Weisberg >Assignee: Sergio Bossa > Attachments: 9318-3.0-nits-trailing-spaces.patch, backpressure.png, > limit.btm, no_backpressure.png > > > It's possible to somewhat bound the amount of load accepted into the cluster > by bounding the number of in-flight requests and request bytes. > An implementation might do something like track the number of outstanding > bytes and requests and if it reaches a high watermark disable read on client > connections until it goes back below some low watermark. > Need to make sure that disabling read on the client connection won't > introduce other issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15489629#comment-15489629 ] Sergio Bossa commented on CASSANDRA-9318: - bq. queryStartNanoTime was introduced by CASSANDRA-12256 Yep, I was just looking at that. bq. Should we increase the timeout by the amount of time spent waiting for the backpressure strategy Yes, this seems the best solution, plus documenting it, that is the fact that with back-pressure enabled the time spent paused in back-pressure is not counted against. bq. Also, it should not be changed if the backpressure strategy is disabled. I changed it regardless in the previous solution for consistency, but I agree that post CASSANDRA-12256 it should be changed only in case of back-pressure. > Bound the number of in-flight requests at the coordinator > - > > Key: CASSANDRA-9318 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9318 > Project: Cassandra > Issue Type: Improvement > Components: Local Write-Read Paths, Streaming and Messaging >Reporter: Ariel Weisberg >Assignee: Sergio Bossa > Attachments: 9318-3.0-nits-trailing-spaces.patch, backpressure.png, > limit.btm, no_backpressure.png > > > It's possible to somewhat bound the amount of load accepted into the cluster > by bounding the number of in-flight requests and request bytes. > An implementation might do something like track the number of outstanding > bytes and requests and if it reaches a high watermark disable read on client > connections until it goes back below some low watermark. > Need to make sure that disabling read on the client connection won't > introduce other issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15489612#comment-15489612 ] Stefania commented on CASSANDRA-9318: - bq. Alright, looks like queryStartNanoTime replaced the old start variable in the latest rebase, let me fix that and check/rerun tests. I understand now. {{queryStartNanoTime}} was introduced by CASSANDRA-12256, it comes all the way from {{QueryProcessor}}. I think the aim was to to take into account the full query processing time from the client point of view, including CQL parsing and so forth. Should we increase the timeout by the amount of time spent waiting for the backpressure strategy rather than resetting the start time? Also, it should not be changed if the backpressure strategy is disabled. > Bound the number of in-flight requests at the coordinator > - > > Key: CASSANDRA-9318 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9318 > Project: Cassandra > Issue Type: Improvement > Components: Local Write-Read Paths, Streaming and Messaging >Reporter: Ariel Weisberg >Assignee: Sergio Bossa > Attachments: 9318-3.0-nits-trailing-spaces.patch, backpressure.png, > limit.btm, no_backpressure.png > > > It's possible to somewhat bound the amount of load accepted into the cluster > by bounding the number of in-flight requests and request bytes. > An implementation might do something like track the number of outstanding > bytes and requests and if it reaches a high watermark disable read on client > connections until it goes back below some low watermark. > Need to make sure that disabling read on the client connection won't > introduce other issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15489578#comment-15489578 ] Sergio Bossa commented on CASSANDRA-9318: - bq. This method initializes the variable but where is it read? Alright, looks like {{queryStartNanoTime}} replaced the old {{start}} variable in the latest rebase, let me fix that and check/rerun tests. > Bound the number of in-flight requests at the coordinator > - > > Key: CASSANDRA-9318 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9318 > Project: Cassandra > Issue Type: Improvement > Components: Local Write-Read Paths, Streaming and Messaging >Reporter: Ariel Weisberg >Assignee: Sergio Bossa > Attachments: 9318-3.0-nits-trailing-spaces.patch, backpressure.png, > limit.btm, no_backpressure.png > > > It's possible to somewhat bound the amount of load accepted into the cluster > by bounding the number of in-flight requests and request bytes. > An implementation might do something like track the number of outstanding > bytes and requests and if it reaches a high watermark disable read on client > connections until it goes back below some low watermark. > Need to make sure that disabling read on the client connection won't > introduce other issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15489558#comment-15489558 ] Stefania commented on CASSANDRA-9318: - bq. The latest rebase brought several changes and conflicts so I was waiting for the jenkins test results to fix any related test failures: let me do that and rerun the suite. OK bq. It is used here. Or am I missing something? This method initializes the variable but where is it read? > Bound the number of in-flight requests at the coordinator > - > > Key: CASSANDRA-9318 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9318 > Project: Cassandra > Issue Type: Improvement > Components: Local Write-Read Paths, Streaming and Messaging >Reporter: Ariel Weisberg >Assignee: Sergio Bossa > Attachments: 9318-3.0-nits-trailing-spaces.patch, backpressure.png, > limit.btm, no_backpressure.png > > > It's possible to somewhat bound the amount of load accepted into the cluster > by bounding the number of in-flight requests and request bytes. > An implementation might do something like track the number of outstanding > bytes and requests and if it reaches a high watermark disable read on client > connections until it goes back below some low watermark. > Need to make sure that disabling read on the client connection won't > introduce other issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15489551#comment-15489551 ] Sergio Bossa commented on CASSANDRA-9318: - Thanks [~Stefania]. The latest rebase brought several changes and conflicts so I was waiting for the jenkins test results to fix any related test failures: let me do that and rerun the suite. Regarding: bq. Is this commit complete? It looks like AbstractWriteResponseHandler.start is not used. It is used [here|https://github.com/apache/cassandra/commit/ad729f2d1758ec8c4add7b71ce3c3a680f5beb6d#diff-71f06c193f5b5e270cf8ac695164f43aR1307]. Or am I missing something? > Bound the number of in-flight requests at the coordinator > - > > Key: CASSANDRA-9318 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9318 > Project: Cassandra > Issue Type: Improvement > Components: Local Write-Read Paths, Streaming and Messaging >Reporter: Ariel Weisberg >Assignee: Sergio Bossa > Attachments: 9318-3.0-nits-trailing-spaces.patch, backpressure.png, > limit.btm, no_backpressure.png > > > It's possible to somewhat bound the amount of load accepted into the cluster > by bounding the number of in-flight requests and request bytes. > An implementation might do something like track the number of outstanding > bytes and requests and if it reaches a high watermark disable read on client > connections until it goes back below some low watermark. > Need to make sure that disabling read on the client connection won't > introduce other issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15489119#comment-15489119 ] Stefania commented on CASSANDRA-9318: - Thanks for the update, and for running the tests. Is it possible to attach some test results? Here is the code review, hopefully we are almost there: * There is an NPE in {{RateBasedBackPressureTest}}, looks like we need to call {{DatabaseDescriptor.daemonInitialization();}}. * {{DatabaseDescriptorRefTest}} has two failures, I think we need to add the new classes that get initialized by DD to the test white list. * There are several failing dtests compared to trunk, it seems we have lots of schema related issues. Can we rerun both jobs to see if this is a coincidence or if we have an issue? This sort of errors appear on trunk as well, but they are usually very rare and only 1 or 2 tests have schema issues, here we had more than 10. * Is [this commit|https://github.com/apache/cassandra/commit/ad729f2d1758ec8c4add7b71ce3c3a680f5beb6d] complete? It looks like {{AbstractWriteResponseHandler.start}} is not used. * Trailing spaces (but they can be fixed on commit). > Bound the number of in-flight requests at the coordinator > - > > Key: CASSANDRA-9318 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9318 > Project: Cassandra > Issue Type: Improvement > Components: Local Write-Read Paths, Streaming and Messaging >Reporter: Ariel Weisberg >Assignee: Sergio Bossa > Attachments: 9318-3.0-nits-trailing-spaces.patch, backpressure.png, > limit.btm, no_backpressure.png > > > It's possible to somewhat bound the amount of load accepted into the cluster > by bounding the number of in-flight requests and request bytes. > An implementation might do something like track the number of outstanding > bytes and requests and if it reaches a high watermark disable read on client > connections until it goes back below some low watermark. > Need to make sure that disabling read on the client connection won't > introduce other issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15487836#comment-15487836 ] Eduard Tudenhoefner commented on CASSANDRA-9318: [~sbtourist] nothing to add from my side. Stuff looks really good > Bound the number of in-flight requests at the coordinator > - > > Key: CASSANDRA-9318 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9318 > Project: Cassandra > Issue Type: Improvement > Components: Local Write-Read Paths, Streaming and Messaging >Reporter: Ariel Weisberg >Assignee: Sergio Bossa > Attachments: 9318-3.0-nits-trailing-spaces.patch, backpressure.png, > limit.btm, no_backpressure.png > > > It's possible to somewhat bound the amount of load accepted into the cluster > by bounding the number of in-flight requests and request bytes. > An implementation might do something like track the number of outstanding > bytes and requests and if it reaches a high watermark disable read on client > connections until it goes back below some low watermark. > Need to make sure that disabling read on the client connection won't > introduce other issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15487712#comment-15487712 ] Russ Hatch commented on CASSANDRA-9318: --- Nothing extra to add here from me. > Bound the number of in-flight requests at the coordinator > - > > Key: CASSANDRA-9318 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9318 > Project: Cassandra > Issue Type: Improvement > Components: Local Write-Read Paths, Streaming and Messaging >Reporter: Ariel Weisberg >Assignee: Sergio Bossa > Attachments: 9318-3.0-nits-trailing-spaces.patch, backpressure.png, > limit.btm, no_backpressure.png > > > It's possible to somewhat bound the amount of load accepted into the cluster > by bounding the number of in-flight requests and request bytes. > An implementation might do something like track the number of outstanding > bytes and requests and if it reaches a high watermark disable read on client > connections until it goes back below some low watermark. > Need to make sure that disabling read on the client connection won't > introduce other issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15487229#comment-15487229 ] Sergio Bossa commented on CASSANDRA-9318: - Testing should be now completed and all the main point have been verified: 1) No performance regressions with back-pressure disabled. 2) No performance regressions with back-pressure enabled and normally behaving cluster. 3) Reduced dropped mutations, typically by an order of magnitude, for continuous heavy write workloads, and full recovery via hints: tested with a 4 nodes RF=3 cluster, ByteMan-based rate limiting applied in several scenarios to one or more nodes, and "slow" back-pressure. [~eduard.tudenhoefner], [~rhatch], do you have anything to add? I've also fixed a bug along the way and rebased to the latest trunk: [~Stefania], do you mind to have a final (hopefully!) review? Updated patch and test runs: | [trunk patch|https://github.com/apache/cassandra/compare/trunk...sbtourist:CASSANDRA-9318-trunk?expand=1] | [testall|https://cassci.datastax.com/view/Dev/view/sbtourist/job/sbtourist-CASSANDRA-9318-trunk-testall/] | [dtest|https://cassci.datastax.com/view/Dev/view/sbtourist/job/sbtourist-CASSANDRA-9318-trunk-dtest/] | [dtest (back-pressure enabled)|https://cassci.datastax.com/view/Dev/view/sbtourist/job/sbtourist-CASSANDRA-9318-bp-true-trunk-dtest/] | > Bound the number of in-flight requests at the coordinator > - > > Key: CASSANDRA-9318 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9318 > Project: Cassandra > Issue Type: Improvement > Components: Local Write-Read Paths, Streaming and Messaging >Reporter: Ariel Weisberg >Assignee: Sergio Bossa > Attachments: 9318-3.0-nits-trailing-spaces.patch, backpressure.png, > limit.btm, no_backpressure.png > > > It's possible to somewhat bound the amount of load accepted into the cluster > by bounding the number of in-flight requests and request bytes. > An implementation might do something like track the number of outstanding > bytes and requests and if it reaches a high watermark disable read on client > connections until it goes back below some low watermark. > Need to make sure that disabling read on the client connection won't > introduce other issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15403537#comment-15403537 ] Stefania commented on CASSANDRA-9318: - It looks good [~sbtourist], I only noticed a typo in cassandra.yaml at this [line|https://github.com/apache/cassandra/compare/trunk...sbtourist:CASSANDRA-9318-trunk?expand=1#diff-bdaab1104a93e723ce0b609a6477c9c4R1161]: "implement provide". I did not repeat the review of the files which were added since they should not have changed, only the existing files with modifications. > Bound the number of in-flight requests at the coordinator > - > > Key: CASSANDRA-9318 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9318 > Project: Cassandra > Issue Type: Improvement > Components: Local Write-Read Paths, Streaming and Messaging >Reporter: Ariel Weisberg >Assignee: Sergio Bossa > Attachments: 9318-3.0-nits-trailing-spaces.patch, backpressure.png, > limit.btm, no_backpressure.png > > > It's possible to somewhat bound the amount of load accepted into the cluster > by bounding the number of in-flight requests and request bytes. > An implementation might do something like track the number of outstanding > bytes and requests and if it reaches a high watermark disable read on client > connections until it goes back below some low watermark. > Need to make sure that disabling read on the client connection won't > introduce other issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15402137#comment-15402137 ] Sergio Bossa commented on CASSANDRA-9318: - Patch rebased to trunk, new links: | [trunk patch|https://github.com/apache/cassandra/compare/trunk...sbtourist:CASSANDRA-9318-trunk?expand=1] | [testall|https://cassci.datastax.com/view/Dev/view/sbtourist/job/sbtourist-CASSANDRA-9318-trunk-testall/] | [dtest|https://cassci.datastax.com/view/Dev/view/sbtourist/job/sbtourist-CASSANDRA-9318-trunk-dtest/] | [dtest (back-pressure enabled)|https://cassci.datastax.com/view/Dev/view/sbtourist/job/sbtourist-CASSANDRA-9318-bp-true-trunk-dtest/] | [~Stefania], feel free to have another quick look at the new patch (no relevant changes by the way). [~eduard.tudenhoefner], in terms of testing, I've already done some on my own as reported in one of my earlier comments, but I am sure you'll have more to add; the objective of our testing should be: 1) Verify there's no performance impact *at all* when back-pressure is disabled. 2) Verify there's no relevant performance impact when back-pressure is enabled and the cluster is well-behaving. 3) Verify back-pressure works correctly when the cluster is overloaded, and dropped mutations are reduced (how much reduced, it depends on the actual back-pressure configuration, and I hope the {{cassandra.yaml}} comments will help with that, otherwise let me know what's unclear and I'll improve them). All tests should be run with at least 4 nodes and RF=3, the attached ByteMan rule can be used to simulate "slow replica" scenarios and kick-in back-pressure. > Bound the number of in-flight requests at the coordinator > - > > Key: CASSANDRA-9318 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9318 > Project: Cassandra > Issue Type: Improvement > Components: Local Write-Read Paths, Streaming and Messaging >Reporter: Ariel Weisberg >Assignee: Sergio Bossa > Attachments: 9318-3.0-nits-trailing-spaces.patch, backpressure.png, > limit.btm, no_backpressure.png > > > It's possible to somewhat bound the amount of load accepted into the cluster > by bounding the number of in-flight requests and request bytes. > An implementation might do something like track the number of outstanding > bytes and requests and if it reaches a high watermark disable read on client > connections until it goes back below some low watermark. > Need to make sure that disabling read on the client connection won't > introduce other issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15398930#comment-15398930 ] Stefania commented on CASSANDRA-9318: - It's in pretty good shape now! :) Let's rebase and start the testing. > Bound the number of in-flight requests at the coordinator > - > > Key: CASSANDRA-9318 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9318 > Project: Cassandra > Issue Type: Improvement > Components: Local Write-Read Paths, Streaming and Messaging >Reporter: Ariel Weisberg >Assignee: Sergio Bossa > Attachments: 9318-3.0-nits-trailing-spaces.patch, backpressure.png, > limit.btm, no_backpressure.png > > > It's possible to somewhat bound the amount of load accepted into the cluster > by bounding the number of in-flight requests and request bytes. > An implementation might do something like track the number of outstanding > bytes and requests and if it reaches a high watermark disable read on client > connections until it goes back below some low watermark. > Need to make sure that disabling read on the client connection won't > introduce other issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15397353#comment-15397353 ] Sergio Bossa commented on CASSANDRA-9318: - [~Stefania], bq. HintsDispatcher.Callback goes through the other one. Also, it is not consistent with the fact that supportsBackPressure() is defined at the callback level. Fair enough, {{supportsBackPressure()}} will do its job in excluding spurious calls anyway. bq. But the time windows of replicas do not necessarily match, they might be staggered. The algorithm was meant to work at steady state, assuming even key distribution, but you made me realize it was nonetheless exposed to subtle fluctuations, so thanks for the excellent point :) It should be now fixed in the latest round of commits. I'm pretty happy with this latest version, so if you're happy too, I'll move forward with a last self-review round and then rebase into trunk. > Bound the number of in-flight requests at the coordinator > - > > Key: CASSANDRA-9318 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9318 > Project: Cassandra > Issue Type: Improvement > Components: Local Write-Read Paths, Streaming and Messaging >Reporter: Ariel Weisberg >Assignee: Sergio Bossa > Attachments: 9318-3.0-nits-trailing-spaces.patch, backpressure.png, > limit.btm, no_backpressure.png > > > It's possible to somewhat bound the amount of load accepted into the cluster > by bounding the number of in-flight requests and request bytes. > An implementation might do something like track the number of outstanding > bytes and requests and if it reaches a high watermark disable read on client > connections until it goes back below some low watermark. > Need to make sure that disabling read on the client connection won't > introduce other issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15389246#comment-15389246 ] Stefania commented on CASSANDRA-9318: - bq. Nope, you can actually implement such kind of strategy after my latest changes to the state interface: you just have to keep recording per-replica state, and then eventually compute a global coordinator state at back-pressure application. Except the state may have nothing to do with a replica but let's leave it for now, I'm fine with this. bq. I ignored it on purpose as that's supposed to be for non mutations. HintsDispatcher.Callback goes through the other one. Also, it is not consistent with the fact that supportsBackPressure() is defined at the callback level. bq. This can't actually happen. Given the same replica group (token range), the replicas are sorted in stable rate limiting order, which means the same fast/slow replica will be picked each time for a given back-pressure window; obviously, the replica order might change at the next window, but that's by design and then all mutations will converge again to the same replica: makes sense? But the time windows of replicas do not necessarily match, they might be staggered. For example replicas 1, 2 and 3 acquired a window for mutation A at the same time, then replicas 2, 3, 4 are used for mutation B, in this case only 3 and 4 will acquire the window because 2 acquired it earlier, so the order may change even during a window. Or does it work by approximation, assuming we send mutations to all replicas frequently enough? > Bound the number of in-flight requests at the coordinator > - > > Key: CASSANDRA-9318 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9318 > Project: Cassandra > Issue Type: Improvement > Components: Local Write-Read Paths, Streaming and Messaging >Reporter: Ariel Weisberg >Assignee: Sergio Bossa > Attachments: 9318-3.0-nits-trailing-spaces.patch, backpressure.png, > limit.btm, no_backpressure.png > > > It's possible to somewhat bound the amount of load accepted into the cluster > by bounding the number of in-flight requests and request bytes. > An implementation might do something like track the number of outstanding > bytes and requests and if it reaches a high watermark disable read on client > connections until it goes back below some low watermark. > Need to make sure that disabling read on the client connection won't > introduce other issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15389176#comment-15389176 ] Sergio Bossa commented on CASSANDRA-9318: - [~Stefania], bq. OK, so we are basically saying other coordinator-based approaches won't be implemented as a back-pressure strategy. Nope, you can actually implement such kind of strategy after my latest changes to the state interface: you just have to keep recording per-replica state, and then eventually compute a global coordinator state at back-pressure application. bq. There's another overload of sendRR I ignored it on purpose as that's supposed to be for non mutations. bq. One final thought, now we have N rate limiters, and we call aquire(1) on only one of them. For the next mutation we may pick another one and so forth. This can't actually happen. Given the same replica group (token range), the replicas are sorted in stable rate limiting order, which means the same fast/slow replica will be picked each time for a given back-pressure window; obviously, the replica order might change at the next window, but that's by design and then all mutations will converge again to the same replica: makes sense? > Bound the number of in-flight requests at the coordinator > - > > Key: CASSANDRA-9318 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9318 > Project: Cassandra > Issue Type: Improvement > Components: Local Write-Read Paths, Streaming and Messaging >Reporter: Ariel Weisberg >Assignee: Sergio Bossa > Attachments: 9318-3.0-nits-trailing-spaces.patch, backpressure.png, > limit.btm, no_backpressure.png > > > It's possible to somewhat bound the amount of load accepted into the cluster > by bounding the number of in-flight requests and request bytes. > An implementation might do something like track the number of outstanding > bytes and requests and if it reaches a high watermark disable read on client > connections until it goes back below some low watermark. > Need to make sure that disabling read on the client connection won't > introduce other issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15388813#comment-15388813 ] Stefania commented on CASSANDRA-9318: - bq. any strategy that doesn't take into account per-node state, is IMHO not a proper back-pressure strategy, and can be implemented in the same way as hints overflowing is implemented (that is, outside the back-pressure APIs). OK, so we are basically saying other coordinator-based approaches won't be implemented as a back-pressure strategy. Provided [~jbellis] and [~slebresne] are fine with this then that's OK with me. bq. Anyway, I've made the BackPressureState interface more generic so it can react on messages when they're about to be sent, which should allow to implement different strategies as you mentioned. This should also make the current RateBased implementation clearer (see below), which is nice I like what you did here, thanks. bq. Apologies, the SP method went through several rewrites in our back and forth and I didn't pay enough attention during the last one, totally stupid mistake. Should be fixed now, but I've kept the back-pressure application on top, as in case the strategy implementation wants to terminate the request straight away (i.e. by throwing an exception) it doesn't make sense to partially send messages (that is, either all or nothing is IMHO better). It looks good now. Agreed on keeping back-pressure on top. bq. The RateBased implementation doesn't "count" outgoing requests when they're sent, which would cause the bug you mentioned, ... I totally forgot this yesterday when I was reasoning about non-local replicas, apologies. bq. This can only happen when a new replica is injected into the system, or if it comes back alive after being dead, otherwise we always send mutations to all replicas, or am I missing something? There's no lock-free way to remove such replicas, because "0 outgoing" is a very transient state, and that's actually the point: the algorithm is designed to work at steady state, and after a single "window" (rpc timeout seconds) the replica state will have received enough updates. Yes you are correct, we always receive replies, even from non-local replicas. I was simply suggesting excluding replicas with positive infinity rate from the sorted states used to calculate the back pressure, but that's no longer relevant. bq. Please have a look at my changes, I will do one more review pass myself as I think some aspects of the rate-based algorithm can be improved, then if we're all happy I will go ahead with rebasing to trunk and then moving into testing. +1 on starting the tests, just two more easy things: * Something went wrong with the IDE imports at this line [here|https://github.com/apache/cassandra/commit/3041d58184a4ee63a04204d3c6228cbaa1c4fd05#diff-0cbd76290040064cf257d855fb5dd779R42], there's a bunch of repeated unused imports that need to be removed. * There's another overload of {{sendRR}} [just above|https://github.com/apache/cassandra/commit/3041d58184a4ee63a04204d3c6228cbaa1c4fd05#diff-af09288f448c37a525e831ee90ea49f9R766], I think technically we should cover this too even though mutations go through the other one. One final thought, now we have N rate limiters, and we call aquire(1) on only one of them. For the next mutation we may pick another one and so forth. So now the rate limiters of each replicas have not 'registered' the actual permits (one per message) and may not throttle as effectively. Have you considered this at all? > Bound the number of in-flight requests at the coordinator > - > > Key: CASSANDRA-9318 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9318 > Project: Cassandra > Issue Type: Improvement > Components: Local Write-Read Paths, Streaming and Messaging >Reporter: Ariel Weisberg >Assignee: Sergio Bossa > Attachments: 9318-3.0-nits-trailing-spaces.patch, backpressure.png, > limit.btm, no_backpressure.png > > > It's possible to somewhat bound the amount of load accepted into the cluster > by bounding the number of in-flight requests and request bytes. > An implementation might do something like track the number of outstanding > bytes and requests and if it reaches a high watermark disable read on client > connections until it goes back below some low watermark. > Need to make sure that disabling read on the client connection won't > introduce other issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15388282#comment-15388282 ] Sergio Bossa commented on CASSANDRA-9318: - [~Stefania], bq. We just need to decide if we are happy having a back pressure state for each OutboundTcpConnectionPool, even though back pressure may be disabled or some strategies may use different metrics, e.g. total in-flight requests or memory, or whether the states should be hidden inside the strategy. As I said, if you move the states inside the strategy, you make the strategy "fattier", and I don't like that. I think keeping the states associated to the connections (as a surrogate for the actual nodes) makes sense, because that's what back-pressure is about; any strategy that doesn't take into account per-node state, is IMHO not a proper back-pressure strategy, and can be implemented in the same way as hints overflowing is implemented (that is, outside the back-pressure APIs). Anyway, I've made the {{BackPressureState}} interface more generic so it can react on messages when they're about to be sent, which should allow to implement different strategies as you mentioned. This should also make the current {{RateBased}} implementation clearer (see below), which is nice :) bq. Now the back pressure is applied before hinting, or inserting local mutations, or sending to remote data centers, but after sending to local replicas. I think we may need to rework SP.sendToHintedEndpoints a little bit, I think we want to fire the insert local, then block, then send all messages. Apologies, the {{SP}} method went through several rewrites in our back and forth and I didn't pay enough attention during the last one, totally stupid mistake. Should be fixed now, but I've kept the back-pressure application on top, as in case the strategy implementation wants to terminate the request straight away (i.e. by throwing an exception) it doesn't make sense to partially send messages (that is, either all or nothing is IMHO better). bq. There is one more potential issue in the case of non local replicas. We send the mutation only to one remote replica, which then forwards it to other replicas in that DC. So remote replicas may have the outgoing rate set to zero, and therefore the limiter rate set to positive infinity, which means we won't throttle at all with FAST flow selected. The {{RateBased}} implementation doesn't "count" outgoing requests when they're sent, which would cause the bug you mentioned, but when the response is received or the callback is expired, in order to guarantee a consistent counting between outgoing and incoming messages, i.e. if outgoing messages are more than incoming ones we know that's because of timeouts, not because some requests are still in-flight; this makes the algorithm way more stable, and also overcomes the problem you mentioned. The latest changes to the state interface should hopefully make this implementation detail clearer. bq. In fact, we have no way of reliably knowing the status of remote replicas, or replicas to which we haven't sent any mutations recently, and we should perhaps exclude them. This can only happen when a new replica is injected into the system, or if it comes back alive after being dead, otherwise we always send mutations to all replicas, or am I missing something? There's no lock-free way to remove such replicas, because "0 outgoing" is a very transient state, and that's actually the point: the algorithm is designed to work at steady state, and after a single "window" (rpc timeout seconds) the replica state will have received enough updates. bq. I think we can start testing once we have sorted the second discussion point above, as for the API issues, we'll eventually need to reach consensus but we don't need to block the tests for it. Please have a look at my changes, I will do one more review pass myself as I think some aspects of the rate-based algorithm can be improved, then if we're all happy I will go ahead with rebasing to trunk and then moving into testing. > Bound the number of in-flight requests at the coordinator > - > > Key: CASSANDRA-9318 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9318 > Project: Cassandra > Issue Type: Improvement > Components: Local Write-Read Paths, Streaming and Messaging >Reporter: Ariel Weisberg >Assignee: Sergio Bossa > Attachments: 9318-3.0-nits-trailing-spaces.patch, backpressure.png, > limit.btm, no_backpressure.png > > > It's possible to somewhat bound the amount of load accepted into the cluster > by bounding the number of in-flight requests and request bytes. > An implementation might do something like track the number of outstanding > bytes and requests and if it
[jira] [Commented] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15386978#comment-15386978 ] Stefania commented on CASSANDRA-9318: - bq. I think if we add too many responsibilities to the strategy it really starts to look less and less as an actual strategy; we could rather add a "BackPressureManager", but to be honest, the current responsibility assignment feels natural to me, and back-pressure is naturally a "messaging layer" concern, so I'd keep everything as it is unless you can show any actual advantages in changing the design. I don't think we want a "BackPressureManager". We just need to decide if we are happy having a back pressure state for each OutboundTcpConnectionPool, even though back pressure may be disabled or some strategies may use different metrics, e.g. total in-flight requests or memory, or whether the states should be hidden inside the strategy. bq. This was intentional, so that outgoing requests could have been processed while pausing; but it had the drawback of making the rate limiting less "predictable" (as dependant on which thread was going to be called next), so I ended up changing it. Now the back pressure is applied before hinting, or inserting local mutations, or sending to remote data centers, but _after_ sending to local replicas. I think we may need to rework SP.sendToHintedEndpoints a little bit, I think we want to fire the insert local, then block, then send all messages. There is one more potential issue in the case of *non local replicas*. We send the mutation only to one remote replica, which then forwards it to other replicas in that DC. So remote replicas may have the outgoing rate set to zero, and therefore the limiter rate set to positive infinity, which means we won't throttle at all with FAST flow selected. In fact, we have no way of reliably knowing the status of remote replicas, or replicas to which we haven't sent any mutations recently, and we should perhaps exclude them. bq. It's not a matter of making the sort stable, but a matter of making the comparator fully consistent with equals, which should be already documented by the treeset/comparator javadoc. It's in the comparator javadoc but not the TreeSet constructor odc, that's why I missed it yesterday. Thanks. bq. Agreed. If we all (/cc Aleksey Yeschenko Sylvain Lebresne Jonathan Ellis) agree on the core concepts of the current implementation, I will rebase to trunk, remove the trailing spaces, and then I would move into testing it, and in the meantime think/work on adding metrics and making any further adjustments. I think we can start testing once we have sorted the second discussion point above, as for the API issues, we'll eventually need to reach consensus but we don't need to block the tests for it. One more tiny nit: * The documentation of {{RateBasedBackPressureState}} still mentions the overloaded flag. > Bound the number of in-flight requests at the coordinator > - > > Key: CASSANDRA-9318 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9318 > Project: Cassandra > Issue Type: Improvement > Components: Local Write-Read Paths, Streaming and Messaging >Reporter: Ariel Weisberg >Assignee: Sergio Bossa > Attachments: 9318-3.0-nits-trailing-spaces.patch, backpressure.png, > limit.btm, no_backpressure.png > > > It's possible to somewhat bound the amount of load accepted into the cluster > by bounding the number of in-flight requests and request bytes. > An implementation might do something like track the number of outstanding > bytes and requests and if it reaches a high watermark disable read on client > connections until it goes back below some low watermark. > Need to make sure that disabling read on the client connection won't > introduce other issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15386265#comment-15386265 ] Sergio Bossa commented on CASSANDRA-9318: - [~Stefania], I've addressed most of your concerns in my latest two commits, also see my following comments: bq. If we removed MessagingService.getBackPressurePerHost(), which doesn't seem to be used, then we can at least remove getBackPressureRateLimit and getHost from BackPressureState, leaving it with only onMessageSent and onMessageReceived. All of those are used for JMX monitoring (which will be improved more) or logging, so I'd keep them. bq. We can also consider moving the states internally to the strategy I think if we add too many responsibilities to the strategy it really starts to look less and less as an actual strategy; we could rather add a "BackPressureManager", but to be honest, the current responsibility assignment feels natural to me, and back-pressure is naturally a "messaging layer" concern, so I'd keep everything as it is unless you can show any actual advantages in changing the design. bq. We are also applying backpressure after sending the messages, is this intentional? This _was_ intentional, so that outgoing requests could have been processed while pausing; but it had the drawback of making the rate limiting less "predictable" (as dependant on which thread was going to be called next), so I ended up changing it. bq. In the very same comparator lambda, if 2 states are not equal but have the exact same rate limit, then I assume it is intentional to compare them by hash code in order to get a stable sort? If so we should add a comment since tree set only requires the items to be mutually comparable and therefore returning zero would have been allowed as well I believe. It's not a matter of making the sort stable, but a matter of making the comparator fully consistent with {{equals}}, which should be already documented by the treeset/comparator javadoc. bq. One failure that needs to be looked at: Fixed! bq. Let's test the strategy first, then metrics can be added later on. Agreed. If we all (/cc [~iamaleksey] [~slebresne] [~jbellis]) agree on the core concepts of the current implementation, I will rebase to trunk, remove the trailing spaces, and then I would move into testing it, and in the meantime think/work on adding metrics and making any further adjustments. > Bound the number of in-flight requests at the coordinator > - > > Key: CASSANDRA-9318 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9318 > Project: Cassandra > Issue Type: Improvement > Components: Local Write-Read Paths, Streaming and Messaging >Reporter: Ariel Weisberg >Assignee: Sergio Bossa > Attachments: 9318-3.0-nits-trailing-spaces.patch, backpressure.png, > limit.btm, no_backpressure.png > > > It's possible to somewhat bound the amount of load accepted into the cluster > by bounding the number of in-flight requests and request bytes. > An implementation might do something like track the number of outstanding > bytes and requests and if it reaches a high watermark disable read on client > connections until it goes back below some low watermark. > Need to make sure that disabling read on the client connection won't > introduce other issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15385280#comment-15385280 ] Stefania commented on CASSANDRA-9318: - The current implementation is still not that generic: * If we removed {{MessagingService.getBackPressurePerHost()}}, which doesn't seem to be used, then we can at least remove {{getBackPressureRateLimit}} and {{getHost}} from {{BackPressureState}}, leaving it with only {{onMessageSent}} and {{onMessageReceived}}. * We can also consider moving the states internally to the strategy, at the cost of an O(1) lookup, or at least encapsulate {{MessagingService.updateBackPressureState}} so that the states become opaque. Eventually we could have an update type (message received, memory threshold reached, etc) and so that strategies can only deal with the type of updates they require. * The apply method doesn't leave much room in terms of what to do, we can basically sleep after a set of replicas has been selected by SP or throw an exception. We are also applying backpressure _after sending the messages_, is this intentional? Some other nits: * In RateBasedBackPressure some fields can have a package based access rather than public/protected and the static in the flow enum declaration is redundant. * We may want to add the flow to the logger.info message "Initialized back-pressure..." in the constructor of RateBasedBackPressure. * In {{RateBasedBackPressure.apply(Iterable states)}}, casting the lambda to {{Comparator}} is unnecessary. * In the very same comparator lambda, if 2 states are not equal but have the exact same rate limit, then I assume it is intentional to compare them by hash code in order to get a stable sort? If so we should add a comment since tree set only requires the items to be mutually comparable and therefore returning zero would have been allowed as well I believe. * {{RateBasedBackPressureState}} can be package local, {{windowSize}} can be private and {{getLastAcquire()}} can be package local. * {{sortOrder}} in {{MockBackPressureState}} is not used. * Several trailing spaces. One [failure|https://cassci.datastax.com/view/Dev/view/sbtourist/job/sbtourist-CASSANDRA-9318-3.0-dtest/lastCompletedBuild/testReport/cqlsh_tests.cqlsh_copy_tests/CqlshCopyTest/test_bulk_round_trip_with_timeouts_2/] that needs to be looked at: {code} ERROR [SharedPool-Worker-1] 2016-07-19 12:40:38,257 QueryMessage.java:128 - Unexpected error during query java.lang.IllegalArgumentException: Size should be greater than precision. at com.google.common.base.Preconditions.checkArgument(Preconditions.java:122) ~[guava-18.0.jar:na] at org.apache.cassandra.utils.SlidingTimeRate.(SlidingTimeRate.java:51) ~[main/:na] at org.apache.cassandra.net.RateBasedBackPressureState.(RateBasedBackPressureState.java:65) ~[main/:na] at org.apache.cassandra.net.RateBasedBackPressure.newState(RateBasedBackPressure.java:205) ~[main/:na] at org.apache.cassandra.net.RateBasedBackPressure.newState(RateBasedBackPressure.java:43) ~[main/:na] at org.apache.cassandra.net.MessagingService.getConnectionPool(MessagingService.java:649) ~[main/:na] at org.apache.cassandra.net.MessagingService.getConnection(MessagingService.java:663) ~[main/:na] at org.apache.cassandra.net.MessagingService.sendOneWay(MessagingService.java:812) ~[main/:na] at org.apache.cassandra.net.MessagingService.sendRR(MessagingService.java:755) ~[main/:na] at org.apache.cassandra.net.MessagingService.sendRR(MessagingService.java:733) ~[main/:na] at org.apache.cassandra.service.StorageProxy.truncateBlocking(StorageProxy.java:2380) ~[main/:na] at org.apache.cassandra.cql3.statements.TruncateStatement.execute(TruncateStatement.java:71) ~[main/:na] at org.apache.cassandra.cql3.QueryProcessor.processStatement(QueryProcessor.java:206) ~[main/:na] {code} bq. Metrics can be added in this ticket once we settle on the core algorithm, but then yes any reporting mechanism to clients should be probably dealt with separately as it would probably involve changes to the native protocol (and I'm not sure what's the usual procedure in such case). Let's test the strategy first, then metrics can be added later on. Let's not worry about native protocol changes for now, since we are no longer throwing exceptions. bq. Regarding the memory-threshold strategy, I would like to stress once again the fact it's a coordinator-only back-pressure mechanism which wouldn't directly benefit replicas; also, Ariel Weisberg showed in his tests that such strategy isn't even needed given the local hints back-pressure/overflowing mechanisms already implemented. So my vote goes against it, but if you/others really want it, I'm perfectly fine implementing it, but it will have to be in this ticket, as it requires some
[jira] [Commented] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15382114#comment-15382114 ] Sergio Bossa commented on CASSANDRA-9318: - I've pushed a new update based on the most recent discussions, more specifically: * I've implemented (and locally tested) rate limiting based on either the fastest or slowest replica; rate limiting on the average rate required computing such rate for all replicas in the whole cluster each RPC timeout interval (2 secs by default), which I deemed to be not worth it. * I've removed the low ratio configuration: if we aim to keep the cluster as "responsive" as possible, throwing exceptions or prematurely hinting doesn't make sense. * {{SP.sendToHintedEndpoints}} is largely unchanged: given we only rate limit by a single (fastest/slowest) replica, sorting the endpoints by back-pressure doesn't make sense either. Answering to [~Stefania]'s latest comments: bq. The proposal doesn't address: "The API should provide for reporting load to clients so they can do real load balancing across coordinators and not just round-robin." This is misleading: the implemented back-pressure mechanism is cross-replica, so load balancing to different coordinators would be useless. bq. we can add metrics or another mechanism later on in follow up tickets, as well as consider implementing the memory-threshold strategy. Metrics can be added in this ticket once we settle on the core algorithm, but then yes any reporting mechanism to clients should be probably dealt with separately as it would probably involve changes to the native protocol (and I'm not sure what's the usual procedure in such case). Regarding the memory-threshold strategy, I would like to stress once again the fact it's a coordinator-only back-pressure mechanism which wouldn't directly benefit replicas; also, [~aweisberg] showed in his tests that such strategy isn't even needed given the local hints back-pressure/overflowing mechanisms already implemented. So my vote goes against it, but if you/others really want it, I'm perfectly fine implementing it, but it will have to be in this ticket, as it requires some more changes to the back-pressure API. > Bound the number of in-flight requests at the coordinator > - > > Key: CASSANDRA-9318 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9318 > Project: Cassandra > Issue Type: Improvement > Components: Local Write-Read Paths, Streaming and Messaging >Reporter: Ariel Weisberg >Assignee: Sergio Bossa > Attachments: 9318-3.0-nits-trailing-spaces.patch, backpressure.png, > limit.btm, no_backpressure.png > > > It's possible to somewhat bound the amount of load accepted into the cluster > by bounding the number of in-flight requests and request bytes. > An implementation might do something like track the number of outstanding > bytes and requests and if it reaches a high watermark disable read on client > connections until it goes back below some low watermark. > Need to make sure that disabling read on the client connection won't > introduce other issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15381623#comment-15381623 ] Stefania commented on CASSANDRA-9318: - I'm happy with the suggested way forward. It removes exceptions and it makes the rate based strategy more tunable rather than forcing everything to the slowest replica rate. The proposal doesn't address: bq. The API should provide for reporting load to clients so they can do real load balancing across coordinators and not just round-robin. However, we can add metrics or another mechanism later on in follow up tickets, as well as consider implementing the memory-threshold strategy. > Bound the number of in-flight requests at the coordinator > - > > Key: CASSANDRA-9318 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9318 > Project: Cassandra > Issue Type: Improvement > Components: Local Write-Read Paths, Streaming and Messaging >Reporter: Ariel Weisberg >Assignee: Sergio Bossa > Attachments: 9318-3.0-nits-trailing-spaces.patch, backpressure.png, > limit.btm, no_backpressure.png > > > It's possible to somewhat bound the amount of load accepted into the cluster > by bounding the number of in-flight requests and request bytes. > An implementation might do something like track the number of outstanding > bytes and requests and if it reaches a high watermark disable read on client > connections until it goes back below some low watermark. > Need to make sure that disabling read on the client connection won't > introduce other issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15379014#comment-15379014 ] Sergio Bossa commented on CASSANDRA-9318: - Based on everyone's feedback above, I propose to move forward with these changes: * Break the {{BackPressureStrategy}} apply method into two, one to compute the back-pressure state and return it sorted in ascending order (from lower back-pressure to higher/overloader), the other to actually apply it. * Apply back-pressure on all nodes/states together, and have {{RateBasedBackPressure}} apply it based on the faster, slower, or average rate, depending on configuration: this is to "soften" [~jbellis]' concern about always limiting at the slowest rate. * Rework {{SP.sendToHintedEndpoints}} a little bit to send mutations in the order returned by the {{BackPressureStrategy}}: this is a minor optimization based on the reasoning that "slower replicas" might be also slower to accept writes at the TCP level, and I can skip it if people prefer not to touch SP too much. * Also have {{SP.sendToHintedEndpoints}} directly hint overloaded replicas rather than back-pressure them or throw exception; again, this is an optimization to avoid wasting time with dealing with too slow replicas. Regarding the memory-threshold strategy, I'll hold on implementing it until we consolidate the points above. Regarding any changes related to the write request path going fully non-blocking, code placement will probably change but all such core concepts will stay the same, and I'll be happy to work on any changes needed (CASSANDRA-8457 is another one to keep an eye on /cc [~jasobrown]). > Bound the number of in-flight requests at the coordinator > - > > Key: CASSANDRA-9318 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9318 > Project: Cassandra > Issue Type: Improvement > Components: Local Write-Read Paths, Streaming and Messaging >Reporter: Ariel Weisberg >Assignee: Sergio Bossa > Attachments: 9318-3.0-nits-trailing-spaces.patch, backpressure.png, > limit.btm, no_backpressure.png > > > It's possible to somewhat bound the amount of load accepted into the cluster > by bounding the number of in-flight requests and request bytes. > An implementation might do something like track the number of outstanding > bytes and requests and if it reaches a high watermark disable read on client > connections until it goes back below some low watermark. > Need to make sure that disabling read on the client connection won't > introduce other issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15378784#comment-15378784 ] Stefania commented on CASSANDRA-9318: - I don't have much to add regarding the merits of one approach vs. the other, other than to say that I agree with [~slebresne] that we should implement the strategy API so that both strategies can be supported, and this will make it more likely that the API is fit for even more strategies. I would even go one step further and suggest that the second strategy should be relatively easy to implement if the framework is in place and if we can work out a reasonable threshold. Therefore we could consider implementing and testing both, either as part of a follow up ticket or this one. Another thing I would like to point out is that, once we make read and write requests fully non-blocking, either via CASSANDRA-10993 or CASSANDRA-10528, we will probably have to rethink this. > Bound the number of in-flight requests at the coordinator > - > > Key: CASSANDRA-9318 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9318 > Project: Cassandra > Issue Type: Improvement > Components: Local Write-Read Paths, Streaming and Messaging >Reporter: Ariel Weisberg >Assignee: Sergio Bossa > Attachments: 9318-3.0-nits-trailing-spaces.patch, backpressure.png, > limit.btm, no_backpressure.png > > > It's possible to somewhat bound the amount of load accepted into the cluster > by bounding the number of in-flight requests and request bytes. > An implementation might do something like track the number of outstanding > bytes and requests and if it reaches a high watermark disable read on client > connections until it goes back below some low watermark. > Need to make sure that disabling read on the client connection won't > introduce other issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15377074#comment-15377074 ] Ariel Weisberg commented on CASSANDRA-9318: --- I incorrectly thought request threads block until all responses are received. They only block until enough responses to satisfy the consistency level are received. So running out of request threads is not always going to be an issue because you have enough slow nodes involved in the request. So rate limiting will work in that it can artificially increase your CL (not really but still) to reduce throughput and avoid timeouts. This will also have the effect of preventing the coordinator from using more memory because request threads can't meet their CL and move on to process new requests. Instead they will block in a rate limiter. My measurements showed that you can provision enough memory to let the timeout kick in so I am not sure that is a useful behavior. Sure it eliminates timeouts, but if that is the end goal maybe we need a consistency level that is something like CL.ALL but tolerates unavailable nodes. That would have the same effect without rate limiting. It's still a partial solution because you can't write at full speed to ranges that don't contain slow nodes. > Bound the number of in-flight requests at the coordinator > - > > Key: CASSANDRA-9318 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9318 > Project: Cassandra > Issue Type: Improvement > Components: Local Write-Read Paths, Streaming and Messaging >Reporter: Ariel Weisberg >Assignee: Sergio Bossa > Attachments: 9318-3.0-nits-trailing-spaces.patch, backpressure.png, > limit.btm, no_backpressure.png > > > It's possible to somewhat bound the amount of load accepted into the cluster > by bounding the number of in-flight requests and request bytes. > An implementation might do something like track the number of outstanding > bytes and requests and if it reaches a high watermark disable read on client > connections until it goes back below some low watermark. > Need to make sure that disabling read on the client connection won't > introduce other issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15376955#comment-15376955 ] Ariel Weisberg commented on CASSANDRA-9318: --- There really isn't much memory to play with when deciding when to backpressure. There are 128 requests threads and once those are all consumed by a slow node, which doesn't take long in a small cluster, things stall completely. If things were async you might be able to commit enough memory that requests time out before you need to stall. In other words you can shed via timeouts to nodes and no additional mechanisms are needed. Not reading from clients doesn't address the issue. You have still created a situation in which nodes that are performing well can't make progress because they can no longer read requests from clients because of one slow node. Not reading from clients is the current implementation. Hinting as it works now doesn't address the issue because the slow node may never actually catch up or become faster. Waiting for every request that is going to time out to time out and be hinted is going to restrict the coordinators ability to coordinate. Hinting also doesn't work because there are only 128 concurrent requests that can be in the process of being hinted see paragraph #1. If the coordinator wants to continue to make progress it has to read requests from clients and then quickly know if it should shed them. We could shed them silently in which case the upstream client is going to time out and it's going to exhaust it's memory or threadpool and we have silently and unfixably moved the problem upstream. I suppose clients can try and implement their own health metrics to duplicate the work we are doing at the coordinator, but it still can't force the coordinator to shed so the client can replaced those requests that won't succeed with ones that will. Or we can signal that we aren't going to do that request at this time and the client can engage whatever mitigation strategy it wants to implement. There is a whole separate discussion about what the state of the art needs to be in client drivers to do something useful with this information and how to expose the mechanism and policy choices to applications. Rate limiting isn't really useful. You just end up with all the request threads stuck in the rate limiter and coordinators continue to not make progress. Rate limiting doesn't solve a load issue at the remote end because as I've demonstrated the remote end can buffer up enough requests until shedding kicks in due to the timeout and reduces memory utilization to something the heap can handle. If things were async what would rate limiting look like? Would it be disabling read for clients? How is the coordinator going to make progress then if it can't coordinate requests for healthy nodes? > Bound the number of in-flight requests at the coordinator > - > > Key: CASSANDRA-9318 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9318 > Project: Cassandra > Issue Type: Improvement > Components: Local Write-Read Paths, Streaming and Messaging >Reporter: Ariel Weisberg >Assignee: Sergio Bossa > Attachments: 9318-3.0-nits-trailing-spaces.patch, backpressure.png, > limit.btm, no_backpressure.png > > > It's possible to somewhat bound the amount of load accepted into the cluster > by bounding the number of in-flight requests and request bytes. > An implementation might do something like track the number of outstanding > bytes and requests and if it reaches a high watermark disable read on client > connections until it goes back below some low watermark. > Need to make sure that disabling read on the client connection won't > introduce other issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15376629#comment-15376629 ] Sergio Bossa commented on CASSANDRA-9318: - [~jbellis], yes I think we're going in circles, and probably either one or the other is missing each other point. You said: bq. Pick a number for how much memory we can afford to have taken up by in flight requests (remembering that we need to keep the entirely payload around for potential hint writing) as a fraction of the heap, the way we do with memtables or key cache. If we hit that mark we start throttling new requests and only accept them as old ones drain off. I'll try a last attempt at explaining why _in my opinion_ this is worse _in all accounts_ via an example. Say you have: * U: the memory unit to express back-pressure. * T: the write RPC timeout. * 3 nodes cluster with RF=2. * CL.ONE. Say the coordinator memory threshold is at 10U, and clients hit it with requests worth 20Us, and for simplicity the coordinator is not a replica, which means it has to accommodate 40U of inflight requests. At some point P < T, the first replica answers all of them, while the second replica answers only half, which means the memory threshold is met at 10U requests, which will be drained when either the replica answers (let's call this time interval R) or T elapses. This means: 1) During a time period equal to min(R,T) you'll have 0 throughput. This is made worse if you actually have to wait for the T to elapse, as it gets proportional to T. 2) If replica 2 keeps exhibiting the same slow behaviour, you'll keep having a throughput profile of high peaks and 0 valleys; this is bad not just because of the 0 valleys, but also because during the high peaks the slow replica will end up dropping 10U worth of mutations. Both #1 and #2 look pretty bad outcomes to me and definitely no better than throttling at the speed of the slowest replica. Speaking of which, how would the other algorithm work in such case? Given the same scenario above, we'd have the following: 1) During the first T period (T1), there will be no back-pressure (this is because the algorithm works at time windows of size T), so nothing changes from current behaviour. 2) At T2, it will start throttling at the slow replica rate of 10U (obviously the actual throttling is not memory based in this case but I'll keep the same notation for simplicity): this means that the sustained throughput during T2 will be 10U. 3) For all Tn > T2, throughput will *gradually* increase and eventually reduce, but without ever touching 0, which means _no high peaks causing high dropped mutations rate_, _no 0 valleys of length T_. Hope this clarifies my reasoning, and please let me know if/how my example above is flawed (that is, if I keep missing your point). > Bound the number of in-flight requests at the coordinator > - > > Key: CASSANDRA-9318 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9318 > Project: Cassandra > Issue Type: Improvement > Components: Local Write-Read Paths, Streaming and Messaging >Reporter: Ariel Weisberg >Assignee: Sergio Bossa > Attachments: 9318-3.0-nits-trailing-spaces.patch, backpressure.png, > limit.btm, no_backpressure.png > > > It's possible to somewhat bound the amount of load accepted into the cluster > by bounding the number of in-flight requests and request bytes. > An implementation might do something like track the number of outstanding > bytes and requests and if it reaches a high watermark disable read on client > connections until it goes back below some low watermark. > Need to make sure that disabling read on the client connection won't > introduce other issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15376513#comment-15376513 ] Avi Kivity commented on CASSANDRA-9318: --- FWIW this is exactly what we do. It's not perfect but it seems to work. > Bound the number of in-flight requests at the coordinator > - > > Key: CASSANDRA-9318 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9318 > Project: Cassandra > Issue Type: Improvement > Components: Local Write-Read Paths, Streaming and Messaging >Reporter: Ariel Weisberg >Assignee: Sergio Bossa > Attachments: 9318-3.0-nits-trailing-spaces.patch, backpressure.png, > limit.btm, no_backpressure.png > > > It's possible to somewhat bound the amount of load accepted into the cluster > by bounding the number of in-flight requests and request bytes. > An implementation might do something like track the number of outstanding > bytes and requests and if it reaches a high watermark disable read on client > connections until it goes back below some low watermark. > Need to make sure that disabling read on the client connection won't > introduce other issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15375853#comment-15375853 ] Jonathan Ellis commented on CASSANDRA-9318: --- bq. I can't see any better options than what we implement in this patch for those use cases willing to trade performance for overall stability I feel like we're going in circles here. Here is a better option: Pick a number for how much memory we can afford to have taken up by in flight requests (remembering that we need to keep the entirely payload around for potential hint writing) as a fraction of the heap, the way we do with memtables or key cache. If we hit that mark we start throttling new requests and only accept them as old ones drain off. This has the following benefits: # Strictly better than the status quo. I.e., does not make things worse where the current behavior is fine (single replica misbehaves, we write hints but don't slow things down), and makes things better where the current behavior is not (throttle instead of falling over OOM). # No client side logic is required, all the client sees is slower request acceptance when throttling kicks in. # Gives us a metric we can expose to clients to improve load balancing. # Does not require a lot of tuning. (If the system is overloaded it will eventually reach even a relatively high mark. If it doesn't, well, you're not going to OOM so you don't need to throttle.) > Bound the number of in-flight requests at the coordinator > - > > Key: CASSANDRA-9318 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9318 > Project: Cassandra > Issue Type: Improvement > Components: Local Write-Read Paths, Streaming and Messaging >Reporter: Ariel Weisberg >Assignee: Sergio Bossa > Attachments: 9318-3.0-nits-trailing-spaces.patch, backpressure.png, > limit.btm, no_backpressure.png > > > It's possible to somewhat bound the amount of load accepted into the cluster > by bounding the number of in-flight requests and request bytes. > An implementation might do something like track the number of outstanding > bytes and requests and if it reaches a high watermark disable read on client > connections until it goes back below some low watermark. > Need to make sure that disabling read on the client connection won't > introduce other issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15375752#comment-15375752 ] Jonathan Ellis commented on CASSANDRA-9318: --- bq. if we make the strategy a bit more generic as mentioned above so the decision is made from all replica involved (maybe the strategy should also keep track of the replica-state completely internally so we can implement basic strategy like having a simple high watermark very easy), and we make sure to not throttle too quickly (typically, if a single replica is slow and we don't really need it, start by just hinting him), then I'd be happy moving to the "actually test this" phase and see how it goes. I suppose that's reasonable in principle, with some caveats: # Throwing exceptions shouldn't be part of the API. OverloadedException dates from the Thrift days, where our flow control options were very limited and this was the best we could do to tell clients, "back off." Now that we have our own protocol and full control over Netty we should simply not read more requests until we shed some load. (Since shedding load is a gradual process--requests time out, we write hints, our load goes down--clients will just perceive this as slowing down, which is what we want.) # The API should provide for reporting load to clients so they can do real load balancing across coordinators and not just round-robin. # Throttling requests to the speed of the slowest replica is not something we should ship, even as an option. > Bound the number of in-flight requests at the coordinator > - > > Key: CASSANDRA-9318 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9318 > Project: Cassandra > Issue Type: Improvement > Components: Local Write-Read Paths, Streaming and Messaging >Reporter: Ariel Weisberg >Assignee: Sergio Bossa > Attachments: 9318-3.0-nits-trailing-spaces.patch, backpressure.png, > limit.btm, no_backpressure.png > > > It's possible to somewhat bound the amount of load accepted into the cluster > by bounding the number of in-flight requests and request bytes. > An implementation might do something like track the number of outstanding > bytes and requests and if it reaches a high watermark disable read on client > connections until it goes back below some low watermark. > Need to make sure that disabling read on the client connection won't > introduce other issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15375191#comment-15375191 ] Sergio Bossa commented on CASSANDRA-9318: - [~slebresne], bq. if we make the strategy a bit more generic as mentioned above so the decision is made from all replica involved (maybe the strategy should also keep track of the replica-state completely internally so we can implement basic strategy like having a simple high watermark very easy) I'm already separating the "computation" phase from the "application", in order to address some previous points from my discussion with [~Stefania], so I think that should do it. bq. and we make sure to not throttle too quickly (typically, if a single replica is slow and we don't really need it, start by just hinting him) This can be easily implemented in the strategy itself as a parameter, i.e. "back-pressure cycles before actually starting to rate limit", so I'll do this eventually later. > Bound the number of in-flight requests at the coordinator > - > > Key: CASSANDRA-9318 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9318 > Project: Cassandra > Issue Type: Improvement > Components: Local Write-Read Paths, Streaming and Messaging >Reporter: Ariel Weisberg >Assignee: Sergio Bossa > Attachments: 9318-3.0-nits-trailing-spaces.patch, backpressure.png, > limit.btm, no_backpressure.png > > > It's possible to somewhat bound the amount of load accepted into the cluster > by bounding the number of in-flight requests and request bytes. > An implementation might do something like track the number of outstanding > bytes and requests and if it reaches a high watermark disable read on client > connections until it goes back below some low watermark. > Need to make sure that disabling read on the client connection won't > introduce other issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15375176#comment-15375176 ] Sylvain Lebresne commented on CASSANDRA-9318: - bq. At this point instead of adding more complexity to an approach that fundamentally doesn't solve that, why not back up and use an approach that does the right thing in all 3 cases instead? My understanding of the fundamentals of Sergio's approach is to: # maintain, on the coordinator, a state for each node that keep track of how much in-flight query we have for that node. # on a new write query, check the state for the replicas involved in that query to decide what to do (when to hint the node, when to start rate limiting or when to start rejecting the queries to the client). In that sense, I don't think the approach is fundamentally wrong but I feel the main question is on the "what to do (and when)". And as I'm not sure there is a single perfect answer for that, I do also like the approach of a strategy, if only so experimentation is easier (though technically, instead of just having an {{apply()}} that potentially throws or sleep, I think the strategy should take the replica for the query, and return a list of nodes to query and one to hint (preserving the ability to sleep or throw) to get more options on the "what to do", and not making backpressure a node-per-node thing). In term of the "default" back-pressure strategy we provide, I agree that we should mostly try to solve the scenario 3: we should define some condition where we consider things overloaded and only apply back-pressure from there. Not sure what that exact condition is btw, but I'm not convinced we can come with a good one out of thin air, I think we need to experiment. tl;dr, if we make the strategy a bit more generic as mentioned above so the decision is made from all replica involved (maybe the strategy should also keep track of the replica-state completely internally so we can implement basic strategy like having a simple high watermark very easy), and we make sure to not throttle too quickly (typically, if a single replica is slow and we don't really need it, start by just hinting him), then I'd be happy moving to the "actually test this" phase and see how it goes. > Bound the number of in-flight requests at the coordinator > - > > Key: CASSANDRA-9318 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9318 > Project: Cassandra > Issue Type: Improvement > Components: Local Write-Read Paths, Streaming and Messaging >Reporter: Ariel Weisberg >Assignee: Sergio Bossa > Attachments: 9318-3.0-nits-trailing-spaces.patch, backpressure.png, > limit.btm, no_backpressure.png > > > It's possible to somewhat bound the amount of load accepted into the cluster > by bounding the number of in-flight requests and request bytes. > An implementation might do something like track the number of outstanding > bytes and requests and if it reaches a high watermark disable read on client > connections until it goes back below some low watermark. > Need to make sure that disabling read on the client connection won't > introduce other issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15375132#comment-15375132 ] Sergio Bossa commented on CASSANDRA-9318: - [~jbellis], bq. it causes other problems in the other two (non-global-overload) scenarios. I think you are overstating the problem here, because the first two scenarios are either very limited in time (the first), or very limited in magnitude (the second), and the back-pressure algorithm is configurable to be as sensitive and as reactive as you wish, by tuning the incoming/outgoing imbalance you want to tolerate, and the growth factor. bq. I honestly don't see what is "better" about a "slow every write down to the speed of the slowest, possibly sick, replica" approach. Defining a simple high water mark on requests in flight should be much simpler without the negative side effects. Such kind of threshold would be too arbitrary and coarse grained, but that's not even the problem; the point is rather what you're going to do when the threshold is met. That is, say the high water mark is met, we really have these options: 1) Throttle at the rate of the slow replicas, which is what we do in this patch. 2) Take the slow replica(s) out, which is even worse in terms of availability. 3) Rate limit the message dequeueing in the outbound connection, but this only moves the back-pressure problem from a place to another. 4) Rate limit at a global rate equal to the water mark, but this only helps the coordinator, as such rate might still be too high for the slow replicas. In the end, I can't see any better options than what we implement in this patch for those use cases willing to trade performance for overall stability, and I would at least have it go through proper QA testing, to see how it behaves on larger clusters, fix any sharp edges, and see how it stands overall. > Bound the number of in-flight requests at the coordinator > - > > Key: CASSANDRA-9318 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9318 > Project: Cassandra > Issue Type: Improvement > Components: Local Write-Read Paths, Streaming and Messaging >Reporter: Ariel Weisberg >Assignee: Sergio Bossa > Attachments: 9318-3.0-nits-trailing-spaces.patch, backpressure.png, > limit.btm, no_backpressure.png > > > It's possible to somewhat bound the amount of load accepted into the cluster > by bounding the number of in-flight requests and request bytes. > An implementation might do something like track the number of outstanding > bytes and requests and if it reaches a high watermark disable read on client > connections until it goes back below some low watermark. > Need to make sure that disabling read on the client connection won't > introduce other issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15375012#comment-15375012 ] Jonathan Ellis commented on CASSANDRA-9318: --- bq. Hints are not a solution for chronically overloaded clusters where clients ingest faster than replicas can consume That is the situation I describe in scenario 3, which is the problem I opened this ticket to solve. So, I agree that scenario is a problem, but I don't think this proposal is a very good solution for that, and it causes other problems in the other two (non-global-overload) scenarios. bq. I think we do solve that, actually in a better way, which takes into consideration all replicas, not just the coordinator capacity of acting as a buffer, unless I'm missing a specific case you're referring to? I honestly don't see what is "better" about a "slow every write down to the speed of the slowest, possibly sick, replica" approach. Defining a simple high water mark on requests in flight should be much simpler without the negative side effects. > Bound the number of in-flight requests at the coordinator > - > > Key: CASSANDRA-9318 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9318 > Project: Cassandra > Issue Type: Improvement > Components: Local Write-Read Paths, Streaming and Messaging >Reporter: Ariel Weisberg >Assignee: Sergio Bossa > Attachments: 9318-3.0-nits-trailing-spaces.patch, backpressure.png, > limit.btm, no_backpressure.png > > > It's possible to somewhat bound the amount of load accepted into the cluster > by bounding the number of in-flight requests and request bytes. > An implementation might do something like track the number of outstanding > bytes and requests and if it reaches a high watermark disable read on client > connections until it goes back below some low watermark. > Need to make sure that disabling read on the client connection won't > introduce other issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15374158#comment-15374158 ] Stefania commented on CASSANDRA-9318: - bq. Right, but there isn't much we can do without way more invasive changes. Anyway, I don't think that's actually a problem, as if the coordinator is overloaded we'll end up generating too many hints and fail with OverloadedException (this time with its original meaning), so we should be covered. I tend to agree that it is an approximation we can live with; I also would rather not change the lower levels of messaging service for this. bq. Does it mean we should advance the protocol version in this issue, or delegate to a new issue? We have a number of issues waiting for protocol V5, they are labeled as {{protocolv5}}. Either we make this issue dependent on V5 as well or, since we are committing this as disabled, we delegate to a new issue that is dependent on V5. bq. Do you see any complexity I'm missing there? A new flag would involve a new version and it would need to be handled during rolling upgrades. Even if on its own it is not too complex, the system in its entirety becomes even more complex (different versions, compression, cross-node-timeouts, some verbs are droppable, others aren't and the list goes on). Unless it solves a problem, I don't think we should consider it; and we are saying in other parts of this conversation that hints are no longer a problem. bq. as the advantage would be increased consistency at the expense of more resource consumption, We don't increase consistency if the client has been told the mutation failed IMO. If we are instead referring to replicas that were out of the CL pool and temporarily overloaded, I think they are better off dropping mutations and handling them later on through hints. Basically, I see dropping mutations replica side as a self defense mechanism for replicas, I don't think we should remove it, rather we should focus on a backpressure strategy such that replicas don't need to drop mutations. Also, for the time being, I'd rather focus on the major issue, which is that we haven't reached consensus on how to apply backpressure yet, and propose this new idea in a follow up ticket if backpressure is successful. bq. These are valid concerns of course, and given similar concerns from Jonathan Ellis, I'm working on some changes to avoid write timeouts due to healthy replicas unnaturally throttled by unhealthy ones, and depending on Jonathan Ellis answer to my last comment above, maybe only actually back-pressure if the CL is not met. OK, so we are basically trying to address the 3 scenarios by throttling/failing only if the system as a whole cannot handle the mutations (that is at least CL replicas are slow/overloaded) whereas if less than CL replicas are slow/overloaded, those replicas get hinted? > Bound the number of in-flight requests at the coordinator > - > > Key: CASSANDRA-9318 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9318 > Project: Cassandra > Issue Type: Improvement > Components: Local Write-Read Paths, Streaming and Messaging >Reporter: Ariel Weisberg >Assignee: Sergio Bossa > Attachments: 9318-3.0-nits-trailing-spaces.patch, backpressure.png, > limit.btm, no_backpressure.png > > > It's possible to somewhat bound the amount of load accepted into the cluster > by bounding the number of in-flight requests and request bytes. > An implementation might do something like track the number of outstanding > bytes and requests and if it reaches a high watermark disable read on client > connections until it goes back below some low watermark. > Need to make sure that disabling read on the client connection won't > introduce other issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15373750#comment-15373750 ] Sergio Bossa commented on CASSANDRA-9318: - [~jbellis], bq. I think you just explained why that's not a very good reason to add more complexity here, at least not without a demonstration that it's actually still a problem. Hints are not a solution for chronically overloaded clusters where clients ingest faster than replicas can consume: even if hints get delivered timely and reliably, replicas will always play catchup if there's no back-pressure. I find this pretty straightforward, but as a practical example, try injecting the attached byteman rule into a cluster and see it fall over at a rate of hundred of thousand dropped mutations per minute. bq. But CL is per request. How do you disentangle that client side? Not sure I follow your objection, can you elaborate? bq. And we're still not solving what I think is (post file-based hints) the real problem, my scenario 3. I think we do solve that, actually in a better way, which takes into consideration all replicas, not just the coordinator capacity of acting as a buffer, unless I'm missing a specific case you're referring to? > Bound the number of in-flight requests at the coordinator > - > > Key: CASSANDRA-9318 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9318 > Project: Cassandra > Issue Type: Improvement > Components: Local Write-Read Paths, Streaming and Messaging >Reporter: Ariel Weisberg >Assignee: Sergio Bossa > Attachments: 9318-3.0-nits-trailing-spaces.patch, backpressure.png, > limit.btm, no_backpressure.png > > > It's possible to somewhat bound the amount of load accepted into the cluster > by bounding the number of in-flight requests and request bytes. > An implementation might do something like track the number of outstanding > bytes and requests and if it reaches a high watermark disable read on client > connections until it goes back below some low watermark. > Need to make sure that disabling read on the client connection won't > introduce other issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15373696#comment-15373696 ] Jonathan Ellis commented on CASSANDRA-9318: --- bq. I have interacted with many people using Cassandra that would actually like to see some rate limiting applied for cases 1 and 2 such that things don't fall over (shouldn't happen with new hints hopefully) I think you just explained why that's not a very good reason to add more complexity here, at least not without a demonstration that it's actually still a problem. bq. What if we tailored the algorithm to only: Rate limit if CL replicas are below the high threshold / Throw exception if CL replicas are below the low threshold. But CL is per request. How do you disentangle that client side? And we're still not solving what I think is (post file-based hints) the real problem, my scenario 3. At this point instead of adding more complexity to an approach that fundamentally doesn't solve that, why not back up and use an approach that does the right thing in all 3 cases instead? > Bound the number of in-flight requests at the coordinator > - > > Key: CASSANDRA-9318 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9318 > Project: Cassandra > Issue Type: Improvement > Components: Local Write-Read Paths, Streaming and Messaging >Reporter: Ariel Weisberg >Assignee: Sergio Bossa > Attachments: 9318-3.0-nits-trailing-spaces.patch, backpressure.png, > limit.btm, no_backpressure.png > > > It's possible to somewhat bound the amount of load accepted into the cluster > by bounding the number of in-flight requests and request bytes. > An implementation might do something like track the number of outstanding > bytes and requests and if it reaches a high watermark disable read on client > connections until it goes back below some low watermark. > Need to make sure that disabling read on the client connection won't > introduce other issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15373624#comment-15373624 ] Sergio Bossa commented on CASSANDRA-9318: - [~Stefania], bq. if a message expires before it is sent, we consider this negatively for that replica, since we increment the outgoing rate but not the incoming rate when the callback expires, and still it may have nothing to do with the replica if the message was not sent, it may be due to the coordinator dealing with too many messages. Right, but there isn't much we can do without way more invasive changes. Anyway, I don't think that's actually a problem, as if the coordinator is overloaded we'll end up generating too many hints and fail with {{OverloadedException}} (this time with its original meaning), so we should be covered. bq. I also observe that if a replica has a low rate, then we may block when acquiring the limiter, and this will indirectly throttle for all following replicas, even if they were ready to receive mutations sooner. See my answer at the end. bq. AbstractWriteResponseHandler sets the start time in the constructor, so the time spent acquiring a rate limiter for slow replicas counts towards the total time before the coordinator throws a write timeout exception. See my answer at the end. bq. SP.sendToHintedEndpoints(), we should apply backpressure only if the destination is alive. I know, I'm holding on these changes until we settle on a plan for the whole write path (in terms of what to do with CL, the exception to be thrown etc.). bq. Let's use UnavailableException since WriteFailureException indicates a non-timeout failure when processing a mutation, and so it is not appropriate for this case. For protocol V4 we cannot change UnavailableException, but for V5 we should add a new parameter to it. At the moment it contains , we should add the number of overloaded replicas, so that drivers can treat the two cases differently. Does it mean we should advance the protocol version in this issue, or delegate to a new issue? bq. Marking messages as throttled would let the replica know if backpressure was enabled, that's true, but it also makes the existing mechanism even more complex. How so? In implementation terms, it should be literally as easy as: 1) Add a byte parameter to {{MessageOut}}. 2) Read such byte parameter from {{MessageIn}} and eventually skip dropping it replica-side. 3) If possible (didn't check it), when a "late" response is received on the coordinator, try to cancel the related hint. Do you see any complexity I'm missing there? bq. dropping mutations that have been in the queue for longer that the RPC write timeout is done not only to shed load on the replica, but also to avoid wasting resources to perform a mutation when the coordinator has already returned a timeout exception to the client. This is very true and that's why I said it's a bit of a wild idea. Obviously, that is true outside of back-pressure, as even now it is possible to return a write timeout to clients and still have some or all mutations applied. In the end, it might be good to optionally enable such behaviour, as the advantage would be increased consistency at the expense of more resource consumption, which is a tradeoff some users might want to make, but to be clear, I'm not strictly lobbying to implement it, just trying to reason about pros and cons. bq. I still have concerns regarding additional write timeout exceptions and whether an overloaded or slow replica can slow everything down. These are valid concerns of course, and given similar concerns from [~jbellis], I'm working on some changes to avoid write timeouts due to healthy replicas unnaturally throttled by unhealthy ones, and depending on [~jbellis] answer to my last comment above, maybe only actually back-pressure if the CL is not met. Stay tuned. > Bound the number of in-flight requests at the coordinator > - > > Key: CASSANDRA-9318 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9318 > Project: Cassandra > Issue Type: Improvement > Components: Local Write-Read Paths, Streaming and Messaging >Reporter: Ariel Weisberg >Assignee: Sergio Bossa > Attachments: 9318-3.0-nits-trailing-spaces.patch, backpressure.png, > limit.btm, no_backpressure.png > > > It's possible to somewhat bound the amount of load accepted into the cluster > by bounding the number of in-flight requests and request bytes. > An implementation might do something like track the number of outstanding > bytes and requests and if it reaches a high watermark disable read on client > connections until it goes back below some low watermark. > Need to make sure that disabling read on the client connection won't > introduce
[jira] [Commented] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15373547#comment-15373547 ] Sergio Bossa commented on CASSANDRA-9318: - [~jbellis], bq. Put another way: we do NOT want to limit performance to the slowest node in a set of replicas. That is kind of the opposite of the redundancy we want to provide. What if we tailored the algorithm to only: * Rate limit if CL replicas are below the high threshold. * Throw exception if CL replicas are below the low threshold. By doing so, C* would behave the same provided at least CL replicas behave normally. > Bound the number of in-flight requests at the coordinator > - > > Key: CASSANDRA-9318 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9318 > Project: Cassandra > Issue Type: Improvement > Components: Local Write-Read Paths, Streaming and Messaging >Reporter: Ariel Weisberg >Assignee: Sergio Bossa > Attachments: 9318-3.0-nits-trailing-spaces.patch, backpressure.png, > limit.btm, no_backpressure.png > > > It's possible to somewhat bound the amount of load accepted into the cluster > by bounding the number of in-flight requests and request bytes. > An implementation might do something like track the number of outstanding > bytes and requests and if it reaches a high watermark disable read on client > connections until it goes back below some low watermark. > Need to make sure that disabling read on the client connection won't > introduce other issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15373537#comment-15373537 ] Jeremiah Jordan commented on CASSANDRA-9318: [~jbellis] then we can default this to off. I have interacted with *many* people using Cassandra that would actually like to see some rate limiting applied for cases 1 and 2 such that things don't fall over (shouldn't happen with new hints hopefully) or even hint like crazy. Turning this on would allow that to happen. > Bound the number of in-flight requests at the coordinator > - > > Key: CASSANDRA-9318 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9318 > Project: Cassandra > Issue Type: Improvement > Components: Local Write-Read Paths, Streaming and Messaging >Reporter: Ariel Weisberg >Assignee: Sergio Bossa > Attachments: 9318-3.0-nits-trailing-spaces.patch, backpressure.png, > limit.btm, no_backpressure.png > > > It's possible to somewhat bound the amount of load accepted into the cluster > by bounding the number of in-flight requests and request bytes. > An implementation might do something like track the number of outstanding > bytes and requests and if it reaches a high watermark disable read on client > connections until it goes back below some low watermark. > Need to make sure that disabling read on the client connection won't > introduce other issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15373512#comment-15373512 ] Jonathan Ellis commented on CASSANDRA-9318: --- The more I think about it the more I think the entire approach may be a bad fit for Cassandra. Consider: # If a node has a "hiccup" of slow performance, e.g. due to a GC pause, we want to hint those writes and return success to the client. No need to rate limit. # If a node has a sustained period of slow performance, we want to hint those writes and return success to the client. No need to rate limit, unless we are overwhelmed with hints. (Not sure if hint overload is actually a problem with the new file based hints.) # Where we DO want to rate limit is when the client is throwing more updates at the coordinator than the system can handle, whether that is for a single node or globally across many nodes. So I see this approach as doing the wrong thing for 1 and 2 and only partially helping with 3. Put another way: we do NOT want to limit performance to the slowest node in a set of replicas. That is kind of the opposite of the redundancy we want to provide. > Bound the number of in-flight requests at the coordinator > - > > Key: CASSANDRA-9318 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9318 > Project: Cassandra > Issue Type: Improvement > Components: Local Write-Read Paths, Streaming and Messaging >Reporter: Ariel Weisberg >Assignee: Sergio Bossa > Attachments: 9318-3.0-nits-trailing-spaces.patch, backpressure.png, > limit.btm, no_backpressure.png > > > It's possible to somewhat bound the amount of load accepted into the cluster > by bounding the number of in-flight requests and request bytes. > An implementation might do something like track the number of outstanding > bytes and requests and if it reaches a high watermark disable read on client > connections until it goes back below some low watermark. > Need to make sure that disabling read on the client connection won't > introduce other issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15372114#comment-15372114 ] Stefania commented on CASSANDRA-9318: - bq. But can that really happen? ResponseVerbHandler returns before incrementing back-pressure if the callback is null (i.e. expired), and OutboundTcpConnection doesn't even send outbound messages if they're timed out, or am I missing something? You're correct, we won't count twice because the callback is already null. However, this raises another point, if a message expires before it is sent, we consider this negatively for that replica, since we increment the outgoing rate but not the incoming rate when the callback expires, and still it may have nothing to do with the replica if the message was not sent, it may be due to the coordinator dealing with too many messages. bq. Again, I believe this would make enabling/disabling back-pressure via JMX less user friendly. Fine, let's keep the boolean since it makes life easier for JMX. bq. I do not think sorting replicas is what we really need, as you have to send the mutation to all replicas anyway. I think what you rather need is a way to pre-emptively fail if the write consistency level is not met by enough "non-overloaded" replicas, i.e.: You're correct in that the replicas are not sorted in the write path, only in the read path. I confused the two yesterday. For sure we need to only fail if the write consistency level is not met. I also observe that if a replica has a low rate, then we may block when acquiring the limiter, and this will indirectly throttle for all following replicas, even if they were ready to receive mutations sooner. Therefore, even a single overloaded or slow replica may slow the entire write operation. Further, AbstractWriteResponseHandler sets the start time in the constructor, so the time spent acquiring a rate limiter for slow replicas counts towards the total time before the coordinator throws a write timeout exception. So, unless we increase the write RPC timeout or change the existing behavior, we may observe write timeout exceptions and, at CL.ANY, hints. Also, in SP.sendToHintedEndpoints(), we should apply backpressure only if the destination is alive. {quote} This leaves us with two options: Adding a new exception to the native protocol. Reusing a different exception, with WriteFailureException and UnavailableException the most likely candidates. I'm currently leaning towards the latter option. {quote} Let's use UnavailableException since WriteFailureException indicates a non-timeout failure when processing a mutation, and so it is not appropriate for this case. For protocol V4 we cannot change UnavailableException, but for V5 we should add a new parameter to it. At the moment it contains {{}}, we should add the number of overloaded replicas, so that drivers can treat the two cases differently. Another alternative, as suggested by [~slebresne], is to simply consider overloaded replicas as dead and hint them, therefore throwing unavailable exceptions as usual, but this is slightly less accurate then letting clients know that some replicas were unavailable and some simply overloaded. bq. We only need to ensure the coordinator for that specific mutation has back-pressure enabled, and we could do this by "marking" the MessageOut with a special parameter, what do you think? Marking messages as throttled would let the replica know if backpressure was enabled, that's true, but it also makes the existing mechanism even more complex. Also, as far as I understand it, dropping mutations that have been in the queue for longer that the RPC write timeout is done not only to shed load on the replica, but also to avoid wasting resources to perform a mutation when the coordinator has already returned a timeout exception to the client. I think this still holds true regardless of backpressure. Since we cannot remove a timeout check in the write response handlers, I don't see how it helps to drop it replica side. If the message was throttled, even with cross_node_timeout enabled, the replica should have time to process it before the RPC write timeout expires, so I don't think the extra complexity is justified. bq. If you all agree with that, I'll move forward and make that change. To summarize, I agree with this change, provided the drivers can separate the two cases (node unavailable vs. node overloaded), which they will be able to do with V5 of the native protocol. The alternative, would be to simply consider overloaded replicas as dead and hint them. Further, I still have concerns regarding additional write timeout exceptions and whether an overloaded or slow replica can slow everything down. [~slebresne], [~jbellis] anything else from your side? I think Jonathan's proposal of bounding total outstanding requests to all replicas, is
[jira] [Commented] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15371063#comment-15371063 ] Sergio Bossa commented on CASSANDRA-9318: - bq. One thing that worries me is, how do you distinguish between “node X is slow because we are writing too fast and we need to throttle clients down” and “node X is slow because it is dying, we need to ignore it and accept writes based on other replicas?” I.e. this seems to implicitly push everyone to a kind of CL.ALL model once your threshold triggers, where if one replica is slow then we can't make progress. This was already noted by [~slebresne], and you're both right, this initial implementation is heavily biased towards my specific use case :) But, the above proposed solution should fix it: bq. I think what you rather need is a way to pre-emptively fail if the write consistency level is not met by enough "non-overloaded" replicas, i.e.: If CL.ONE, fail if all replicas are overloaded... Also, the exception would be sent to the client only if the low threshold is met, and only the first time it is met, for the duration of the back-pressure window (write RPC timeout), i.e.: * Threshold is 0.1, outgoing requests are 100, incoming responses are 10, ratio is 0.1. * Exception is thrown by all write requests for the current back-pressure window. * The outgoing rate limiter is set at 10, which means the next ratio calculation will approach the sustainable rate, and even if replicas will still lag behind, the ratio will not go down to 0.1 _unless_ the incoming rate dramatically goes down to 1. This is to say the chances of getting a meltdown due to "overloaded exceptions" is moderate (also, clients are supposed to adjust themselves when getting such exception), and the above proposal should make things play nicely with the CL too. If you all agree with that, I'll move forward and make that change. > Bound the number of in-flight requests at the coordinator > - > > Key: CASSANDRA-9318 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9318 > Project: Cassandra > Issue Type: Improvement > Components: Local Write-Read Paths, Streaming and Messaging >Reporter: Ariel Weisberg >Assignee: Sergio Bossa > Attachments: 9318-3.0-nits-trailing-spaces.patch, backpressure.png, > limit.btm, no_backpressure.png > > > It's possible to somewhat bound the amount of load accepted into the cluster > by bounding the number of in-flight requests and request bytes. > An implementation might do something like track the number of outstanding > bytes and requests and if it reaches a high watermark disable read on client > connections until it goes back below some low watermark. > Need to make sure that disabling read on the client connection won't > introduce other issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15371010#comment-15371010 ] Jonathan Ellis commented on CASSANDRA-9318: --- If I understand this approach correctly, you're looking at the ratio of sent to acknowledged writes per replica, and throwing Unavailable if that gets too low for a given replica. Very clever. One thing that worries me is, how do you distinguish between “node X is slow because we are writing too fast and we need to throttle clients down” and “node X is slow because it is dying, we need to ignore it and accept writes based on other replicas?” I.e. this seems to implicitly push everyone to a kind of CL.ALL model once your threshold triggers, where if one replica is slow then we can't make progress. If we take a simpler approach of just bounding total outstanding requests to all replicas per coordinator, then we can avoid overload meltdown while allowing CL to continue to work as designed and tolerate as many slow replicas as the requested CL permits. > Bound the number of in-flight requests at the coordinator > - > > Key: CASSANDRA-9318 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9318 > Project: Cassandra > Issue Type: Improvement > Components: Local Write-Read Paths, Streaming and Messaging >Reporter: Ariel Weisberg >Assignee: Sergio Bossa > Attachments: 9318-3.0-nits-trailing-spaces.patch, backpressure.png, > limit.btm, no_backpressure.png > > > It's possible to somewhat bound the amount of load accepted into the cluster > by bounding the number of in-flight requests and request bytes. > An implementation might do something like track the number of outstanding > bytes and requests and if it reaches a high watermark disable read on client > connections until it goes back below some low watermark. > Need to make sure that disabling read on the client connection won't > introduce other issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15370533#comment-15370533 ] Sergio Bossa commented on CASSANDRA-9318: - [~Stefania], bq. Do we track the case where we receive a failed response? Specifically, in ResponseVerbHandler.doVerb, shouldn't we call updateBackPressureState() also when the message is a failure response? Good point, I focused more on the success case, due to dropped mutations, but that sounds like a good thing to do. bq. If we receive a response after it has timed out, won't we count that request twice, incorrectly increasing the rate for that window? But can that really happen? {{ResponseVerbHandler}} returns _before_ incrementing back-pressure if the callback is null (i.e. expired), and {{OutboundTcpConnection}} doesn't even send outbound messages if they're timed out, or am I missing something? bq. I also argue that it is quite easy to comment out the strategy and to have an empty strategy in the code that means no backpressure. Again, I believe this would make enabling/disabling back-pressure via JMX less user friendly. bq. I think what we may need is a new companion snitch that sorts the replica by backpressure ratio I do not think sorting replicas is what we really need, as you have to send the mutation to all replicas anyway. I think what you rather need is a way to pre-emptively fail if the write consistency level is not met by enough "non-overloaded" replicas, i.e.: * If CL.ONE, fail in *all* replicas are overloaded. * If CL.QUORUM, fail if *quorum* replicas are overloaded. * if CL.ALL, fail if *any* replica is overloaded. This can be easily accomplished in {{StorageProxy#sendToHintedEndpoints}}. bq. the exception needs to be different. native_protocol_v4.spec clearly states I missed that too :( This leaves us with two options: * Adding a new exception to the native protocol. * Reusing a different exception, with {{WriteFailureException}} and {{UnavailableException}} the most likely candidates. I'm currently leaning towards the latter option. bq. By "load shedding by the replica" do we mean dropping mutations that have timed out or something else? Yes. bq. Regardless, there is the problem of ensuring that all nodes have backpressure enabled, which may not be trivial. We only need to ensure the coordinator for that specific mutation has back-pressure enabled, and we could do this by "marking" the {{MessageOut}} with a special parameter, what do you think? > Bound the number of in-flight requests at the coordinator > - > > Key: CASSANDRA-9318 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9318 > Project: Cassandra > Issue Type: Improvement > Components: Local Write-Read Paths, Streaming and Messaging >Reporter: Ariel Weisberg >Assignee: Sergio Bossa > Attachments: 9318-3.0-nits-trailing-spaces.patch, backpressure.png, > limit.btm, no_backpressure.png > > > It's possible to somewhat bound the amount of load accepted into the cluster > by bounding the number of in-flight requests and request bytes. > An implementation might do something like track the number of outstanding > bytes and requests and if it reaches a high watermark disable read on client > connections until it goes back below some low watermark. > Need to make sure that disabling read on the client connection won't > introduce other issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15370159#comment-15370159 ] Stefania commented on CASSANDRA-9318: - bq. I've pushed a few more commits to address your concerns. The new commits look good, the code is clearly cleaner with an abstract BackPressureState created by the strategy, and the back-pressure window equal to the write timeout. Some things left to do: * Rebase on trunk * Get rid of the trailing spaces bq. more specifically, the rates are now tracked together when either a response is received or the callback is expired, I have some concerns with this: * Do we track the case where we receive a failed response? Specifically, in ResponseVerbHandler.doVerb, shouldn't we call updateBackPressureState() also when the message is a failure response? * If we receive a response after it has timed out, won't we count that request twice, incorrectly increasing the rate for that window? bq. Configuration-wise, we're now left with only the back_pressure_enabled boolean and the back_pressure_strategy, and I'd really like to keep the former, as it makes way easier to dynamically turn the back-pressure on/off. It's much better right now and I am not strongly opposed to keeping back_pressure_enabled, but I also argue that it is quite easy to comment out the strategy and to have an empty strategy in the code that means no backpressure. bq. Talking about the overloaded state and the usage of OverloadedException, I agree the latter might be misleading, and I agree some failure conditions could lead to requests being wrongly refused, but I'd also like to keep some form of "emergency" feedback towards the client: what about throwing OE only if all (or a given number depending on the CL?) replicas are overloaded? I think what we may need is a new companion snitch that sorts the replica by backpressure ratio, or an enhanced dynamic snitch that is aware of the backpressure strategy. If we sort replicas correctly, then we can throw an exception to the client if one of the selected replicas are overloaded. However, I think the exception needs to be different. native_protocol_v4.spec clearly states: {code} 0x1001Overloaded: the request cannot be processed because the coordinator node is overloaded {code} Sorry I totally missed the meaning of this exception in the previous round of review. bq. Finally, one more wild idea to consider: given this patch greatly reduces the number of dropped mutations, and hence the number of inflight hints, what do you think about disabling load shedding by the replica side when back-pressure is enabled? This way we'd trade "full consistency" for an hopefully smaller number of unnecessary hints sent over to "pressured" replicas when their callbacks expire by the coordinator side. By "load shedding by the replica" do we mean dropping mutations that have timed out or something else? Regardless, there is the problem of ensuring that all nodes have backpressure enabled, which may not be trivial. I think replicas should have a mechanism of protecting themselves, rather than relying on the coordinator but I can comment further if we clarify what we mean exactly. > Bound the number of in-flight requests at the coordinator > - > > Key: CASSANDRA-9318 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9318 > Project: Cassandra > Issue Type: Improvement > Components: Local Write-Read Paths, Streaming and Messaging >Reporter: Ariel Weisberg >Assignee: Sergio Bossa > Attachments: 9318-3.0-nits-trailing-spaces.patch, backpressure.png, > limit.btm, no_backpressure.png > > > It's possible to somewhat bound the amount of load accepted into the cluster > by bounding the number of in-flight requests and request bytes. > An implementation might do something like track the number of outstanding > bytes and requests and if it reaches a high watermark disable read on client > connections until it goes back below some low watermark. > Need to make sure that disabling read on the client connection won't > introduce other issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15366583#comment-15366583 ] Sergio Bossa commented on CASSANDRA-9318: - [~Stefania], [~slebresne], I've pushed a few more commits to address your concerns. First of all, I've got rid of the back-pressure timeout: the back-pressure window for the rate-based algorithm is now equal to the write timeout, and the overall implementation has been improved to better track in/out rates and avoid the need of a larger window; more specifically, the rates are now tracked together when either a response is received or the callback is expired, which avoids edge cases causing an unbalanced in/out rate when a burst of outgoing messages is recorded on the edge of a window. Also, I've abstracted {{BackPressureState}} into an interface as requested. Configuration-wise, we're now left with only the {{back_pressure_enabled}} boolean and the {{back_pressure_strategy}}, and I'd really like to keep the former, as it makes way easier to dynamically turn the back-pressure on/off. Talking about the overloaded state and the usage of {{OverloadedException}}, I agree the latter might be misleading, and I agree some failure conditions could lead to requests being wrongly refused, but I'd also like to keep some form of "emergency" feedback towards the client: what about throwing OE only if _all_ (or a given number depending on the CL?) replicas are overloaded? Regarding when and how to ship this, I'm fine with trunk and I agree it should be off by default for now. Finally, one more wild idea to consider: given this patch greatly reduces the number of dropped mutations, and hence the number of inflight hints, what do you think about disabling load shedding by the replica side when back-pressure is enabled? This way we'd trade "full consistency" for an hopefully smaller number of unnecessary hints sent over to "pressured" replicas when their callbacks expire by the coordinator side. > Bound the number of in-flight requests at the coordinator > - > > Key: CASSANDRA-9318 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9318 > Project: Cassandra > Issue Type: Improvement > Components: Local Write-Read Paths, Streaming and Messaging >Reporter: Ariel Weisberg >Assignee: Sergio Bossa > Attachments: 9318-3.0-nits-trailing-spaces.patch, backpressure.png, > limit.btm, no_backpressure.png > > > It's possible to somewhat bound the amount of load accepted into the cluster > by bounding the number of in-flight requests and request bytes. > An implementation might do something like track the number of outstanding > bytes and requests and if it reaches a high watermark disable read on client > connections until it goes back below some low watermark. > Need to make sure that disabling read on the client connection won't > introduce other issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15365817#comment-15365817 ] Sylvain Lebresne commented on CASSANDRA-9318: - A first concern I have, that has been brought up earlier in the conversation already, is that this will create a lot more OverloadedException that we're currently used to, with 2 concrete issues imo: # the new OverloadedException has a different meaning than the old one. The one meant the coordinator was overloaded (by too much hints writing), so client drivers will tend to try another coordinator when that happens. That's not a good idea here. # I think in many case that's not what the user want. If you write at CL.ONE, and 1 of your 3 replica is struggling (but other nodes are perfectly fine), having all your queries rejected (breaking availability concretely) until that one replica catch up (or dies) feels quite wrong. What I think you'd want is consider the node dead for that query and hint that slow node (and if a coordinator gets overwhelmed by those hints, we already throw an OE), but proceed with the write. I'll not in particular that when a node dies, it's not detected so right away, and detection can actually take a few seconds, which might well mean a node that just dies will be considered overloaded temporarilly by some coordinator. So considering that overloaded is roughtly the same as dead makes some sense (you wouldn't want to start failing writes because a node dies due to this mechanism if you have enough node for your CL in particular). bq. For trunk, we can enable it by default provided we can test it on a medium size cluster before committing We *should* test it on a medium sized cluster before committing, but even then I'm really reluctant making it the default on commit. I don't think making new features untested in production and can that have huge impacts the default right away is smart, even though we have done it numerous time (unsucessfully most of the time I'd say). I'd much rather leave it opt-in for a few releases so users can test it and wait more feedback until we make it the default (I know the counter argument, that no-one will test it unless it's a default, but I doubt that's true if people are genuinely put off by the existing behavior). In general, as was already brought up earlier in the discussion, I suspect fine tuning the parameters won't be trivial, and I think we'll need a fair amount of testing in different conditions to guarantee the defaults for those parameters are sane. bq. back_pressure_timeout_override: should we call it for what it is, the back_pressure_window_size and recommend that they increase the write timeout in the comments I concur that I don't love the name either, nor the concept of having an override for another setting. I'd also prefer splitting the double meaning, leaving {{write_request_timeout_in_ms}} always be the timeout, and add a {{window_size}} in the strategy parameter. We can additionally do 2 things to get roughly the same default that in the current patch: # make {{window_size}} be equal to the write timeout by default. # make the {{*_request_timeout_in_ms}} option "silent" default (i.e. commented by default), and make the default for the write one depend on whether back_pressure is on or not. To finish on the yaml option, if we move {{window_size}} to the strategy parameters, we could also got rid of {black_pressure_enabled}} and instead have a {{NoBackPressure}} strategy (which we can totally special case internally). Don't care terribly about it, but it's imo slightly more inline with other strategies. I'd also suggest abstracting the {{BackPressureState}} state as an interface and making the strategy responsible of creating a new one, since most of the details of the state are likely strategy somewhat specific. As far as I can, all we'd need as interface is: {noformat} interface BackPressureState { void onMessageSent(); void onResponseReceived(); double currentBackPressureRate(); } {noformat} This won't provide more info for custom implementation just yet, but at least it'll give them more flexibility, and it's imo a bit clearer. > Bound the number of in-flight requests at the coordinator > - > > Key: CASSANDRA-9318 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9318 > Project: Cassandra > Issue Type: Improvement > Components: Local Write-Read Paths, Streaming and Messaging >Reporter: Ariel Weisberg >Assignee: Sergio Bossa > Attachments: 9318-3.0-nits-trailing-spaces.patch, backpressure.png, > limit.btm, no_backpressure.png > > > It's possible to somewhat bound the amount of load accepted into the cluster > by bounding the number of in-flight requests and
[jira] [Commented] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15364517#comment-15364517 ] Aleksey Yeschenko commented on CASSANDRA-9318: -- I'm not sure that putting anything like this in 3.0.x is reasonable, sorry. It's not technically a bug fix. Whether or not it should be enabled by default in 3.x is up for debate (would have to be an even feature release, too). > Bound the number of in-flight requests at the coordinator > - > > Key: CASSANDRA-9318 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9318 > Project: Cassandra > Issue Type: Improvement > Components: Local Write-Read Paths, Streaming and Messaging >Reporter: Ariel Weisberg >Assignee: Sergio Bossa > Attachments: 9318-3.0-nits-trailing-spaces.patch, backpressure.png, > limit.btm, no_backpressure.png > > > It's possible to somewhat bound the amount of load accepted into the cluster > by bounding the number of in-flight requests and request bytes. > An implementation might do something like track the number of outstanding > bytes and requests and if it reaches a high watermark disable read on client > connections until it goes back below some low watermark. > Need to make sure that disabling read on the client connection won't > introduce other issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15363543#comment-15363543 ] Stefania commented on CASSANDRA-9318: - Thanks for your input [~jjordan]. I'm +1 with committing this to 3.0/3.9 disabled by default as the impact should be minimal. For trunk, we can enable it by default provided we can test it on a medium size cluster before committing. If that's not the case, we can commit it disabled by default, and open a new follow-up ticket, to test it and enable it by default once it has been tested. We should probably also prepare a short paragraph for NEWS.txt [~sbtourist]. > Bound the number of in-flight requests at the coordinator > - > > Key: CASSANDRA-9318 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9318 > Project: Cassandra > Issue Type: Improvement > Components: Local Write-Read Paths, Streaming and Messaging >Reporter: Ariel Weisberg >Assignee: Sergio Bossa > Attachments: 9318-3.0-nits-trailing-spaces.patch, backpressure.png, > limit.btm, no_backpressure.png > > > It's possible to somewhat bound the amount of load accepted into the cluster > by bounding the number of in-flight requests and request bytes. > An implementation might do something like track the number of outstanding > bytes and requests and if it reaches a high watermark disable read on client > connections until it goes back below some low watermark. > Need to make sure that disabling read on the client connection won't > introduce other issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15363288#comment-15363288 ] Jeremiah Jordan commented on CASSANDRA-9318: bq. One thing we did not confirm is whether you are happy committing this only to trunk or whether you need this in 3.0. Strictly speaking 3.0 accepts only bug fixes, not new features. However, this is an optional feature that solves a problem (dropped mutations) and that is disabled by default, so we have a case for an exception. I think it would be nice for users to get this in 3.0/3.9 if you think it will have minimal effect when disabled. I also think it would be good to enable it by default in trunk. We have lots of people who get put off by the fact that you can overload your cluster enough to start dropping mutations, and the system doesn't "slow things down" such that it can cope. > Bound the number of in-flight requests at the coordinator > - > > Key: CASSANDRA-9318 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9318 > Project: Cassandra > Issue Type: Improvement > Components: Local Write-Read Paths, Streaming and Messaging >Reporter: Ariel Weisberg >Assignee: Sergio Bossa > Attachments: 9318-3.0-nits-trailing-spaces.patch, backpressure.png, > limit.btm, no_backpressure.png > > > It's possible to somewhat bound the amount of load accepted into the cluster > by bounding the number of in-flight requests and request bytes. > An implementation might do something like track the number of outstanding > bytes and requests and if it reaches a high watermark disable read on client > connections until it goes back below some low watermark. > Need to make sure that disabling read on the client connection won't > introduce other issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15360784#comment-15360784 ] Stefania commented on CASSANDRA-9318: - bq. For that specific test I've got no client timeouts at all, as I wrote at ONE. Sorry I should have been clearer, I meant what were the {{write_request_timeout_in_ms}} and {{back_pressure_timeout_override}} yaml settings? bq. Agreed with all your points. I'll see what I can do, but any help/pointers will be very appreciated. We can do the following: bq. verify we can reduce the number of dropped mutations in a larger (5-10 nodes) cluster with multiple clients writing simultaneously I will ask for help to the TEs, more details to follow. bq. some cstar perf tests to ensure ops per second are not degraded, both read and writes We can launch a comparison test [here|http://cstar.datastax.com], 30M rows should be enough. I can launch it for you if you don't have an account. bq. the dtests should be run with and without backpressure enabled This can be done by temporarily changing cassandra.yaml on your branch and then launching the dtests. bq. we should do a bulk load test, for example for cqlsh COPY FROM I can take care of this. I don't expect problems because COPY FROM should contact the replicas directly, it's just a box I want to tick. Importing 5 to 10M rows with 3 nodes should be sufficient. bq. Please send me a PR and I'll incorporate those in my branch I couldn't create a PR, for some reason sbtourist/cassandra wasn't in the base fork list. I've attached a patch to this ticket, [^9318-3.0-nits-trailing-spaces.patch]. bq. I find the current layout effective and simple enough, but I'll not object if you want to push those under a common "container" option. The encryption options are what I was aiming at, but it's true that for everything else we have a flat layout, so let's leave it as it is. bq. I don't like much that name either, as it doesn't convey very well the (double) meaning; making the back-pressure window the same as the write timeout is not strictly necessary, but it makes the algorithm behave better in terms of reducing dropped mutations as it gives replica more time to process its backlog after the rate is reduced. Let me think about that a bit more, but I'd like to avoid requiring the user to increase the write timeout manually, as again, it reduces the effectiveness of the algorithm. I'll let you think about it. Maybe a boolean property that is true by default and that clearly indicates that the timeout is overridden, although this complicates things somewhat. bq. Sure I can switch to that on trunk, if you think it's worth performance-wise (I can write a JMH test if there isn't one already). The precision is only 10 milliseconds, if this is acceptable it would be interesting to see what the difference in performance is. bq. It is not used in any unit tests code, but it is used in my manual byteman tests, and unfortunately I need it on the C* classpath; is that a problem to keep it? Sorry I missed the byteman imports and helper. Let's just move it to the test source folder and add a comment. -- The rest of the CR points are fine. One thing we did not confirm is whether you are happy committing this only to trunk or whether you need this in 3.0. Strictly speaking 3.0 accepts only bug fixes, not new features. However, this is an optional feature that solves a problem (dropped mutations) and that is disabled by default, so we have a case for an exception. > Bound the number of in-flight requests at the coordinator > - > > Key: CASSANDRA-9318 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9318 > Project: Cassandra > Issue Type: Improvement > Components: Local Write-Read Paths, Streaming and Messaging >Reporter: Ariel Weisberg >Assignee: Sergio Bossa > Attachments: 9318-3.0-nits-trailing-spaces.patch, backpressure.png, > limit.btm, no_backpressure.png > > > It's possible to somewhat bound the amount of load accepted into the cluster > by bounding the number of in-flight requests and request bytes. > An implementation might do something like track the number of outstanding > bytes and requests and if it reaches a high watermark disable read on client > connections until it goes back below some low watermark. > Need to make sure that disabling read on the client connection won't > introduce other issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15356882#comment-15356882 ] Sergio Bossa commented on CASSANDRA-9318: - Thanks for reviewing [~Stefania]! See my comments below... bq. If I understood correctly, what we are trying to achieve with this new approach, is reducing the number of dropped mutations rather than preventing OOM but, in a way that doesn't cause OOM by invoking a rate-limiter-acquire in a Netty shared worker pool thread, which would slow down our ability to handle new client requests, as well as throw OEs once in a while when a low water-mark is reached. Is this a fair description? Correct. I'd just highlight the fact that reducing dropped mutations and replica instability is to me as important as avoiding OOMEs on the coordinator; [~jshook] has some very good points about that on his comment above. bq. In the CCM test performed with two nodes, what was the difference in write timeout? For that specific test I've got no client timeouts at all, as I wrote at ONE. bq. In terms of testing, here is what I would suggest Agreed with all your points. I'll see what I can do, but any help/pointers will be very appreciated. bq. As for the code review, I've addressed some nits, mostly comments and trailing spaces, directly here. Looks good! Please send me a PR and I'll incorporate those in my branch :) bq. Should we perhaps group all settings under something like back_pressure_options? I find the current layout effective and simple enough, but I'll not object if you want to push those under a common "container" option. bq. back_pressure_timeout_override: should we call it for what it is, the back_pressure_window_size and recommend that they increase the write timeout in the comments, or is an override of the write timeout necessary for correctness? I don't like much that name either, as it doesn't convey very well the (double) meaning; making the back-pressure window the same as the write timeout is not _strictly_ necessary, but it makes the algorithm behave better in terms of reducing dropped mutations as it gives replica more time to process its backlog after the rate is reduced. Let me think about that a bit more, but I'd like to avoid requiring the user to increase the write timeout manually, as again, it reduces the effectiveness of the algorithm. bq. Should we use a dedicated package? I'd keep it plain as it is now. bq. Depending on the precision, in trunk there is an approximate time service with a precision of 10 ms, which is faster than calling System.currentTimeMillis() Sure I can switch to that on trunk, if you think it's worth performance-wise (I can write a JMH test if there isn't one already). bq. TESTRATELIMITER.JAVA - Is this still used? It is not used in any unit tests code, but it is used in my manual byteman tests, and unfortunately I need it on the C* classpath; is that a problem to keep it? (I can add proper javadocs to make that clear) bq. TESTTIMESOURCE.JAVA - I think we need to move it somewhere with the test sources. Leftover there from a previous test implementation; I'll move it. > Bound the number of in-flight requests at the coordinator > - > > Key: CASSANDRA-9318 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9318 > Project: Cassandra > Issue Type: Improvement > Components: Local Write-Read Paths, Streaming and Messaging >Reporter: Ariel Weisberg >Assignee: Sergio Bossa > Attachments: backpressure.png, limit.btm, no_backpressure.png > > > It's possible to somewhat bound the amount of load accepted into the cluster > by bounding the number of in-flight requests and request bytes. > An implementation might do something like track the number of outstanding > bytes and requests and if it reaches a high watermark disable read on client > connections until it goes back below some low watermark. > Need to make sure that disabling read on the client connection won't > introduce other issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15356800#comment-15356800 ] Stefania commented on CASSANDRA-9318: - Well, I read all of the discussion above and it took perhaps longer than to actually review the code! :) If I understood correctly, what we are trying to achieve with this new approach, is reducing the number of dropped mutations rather than preventing OOM but, in a way that doesn't cause OOM by invoking a rate-limiter-acquire in a Netty shared worker pool thread, which would slow down our ability to handle new client requests, as well as throw OEs once in a while when a low water-mark is reached. Is this a fair description? Assuming we can prove this in a larger scenario, I am +1 on the approach but I would think it should go to trunk rather than 3.0? In the CCM test performed with two nodes, what was the difference in write timeout? In terms of testing, here is what I would suggest (cc [~cassandra-te] for any help they may be able to provide): * verify we can reduce the number of dropped mutations in a larger (5-10 nodes) cluster with multiple clients writing simultaneously * some cstar perf tests to ensure ops per second are not degraded, both read and writes * the dtests should be run with and without backpressure enabled * we should do a bulk load test, for example for cqlsh COPY FROM As for the code review, I've addressed some nits, mostly comments and trailing spaces, directly [here|https://github.com/stef1927/cassandra/commit/334509c10a5f4f759550fb64902aabc8c4a4f40c]. The code is really of excellent quality and very well unit tested, so I didn't really find much, and they are mostly suggestions. I will however have another pass next week, now that I am more familiar with the feature. h5. conf/cassandra.yaml I've edited the comments slightly, mostly to spell out for users why they may need a back pressure strategy, and to put the emphasis on the default implementation first, then say how to create new strategies. As you mentioned, new strategies may require new state parameters anyway, so it's not clear how easy it will be to add new strategies. You may be able to improve on my comments further. Should we perhaps group all settings under something like back_pressure_options? back_pressure_timeout_override: should we call it for what it is, the back_pressure_window_size and recommend that they increase the write timeout in the comments, or is an override of the write timeout necessary for correctness? h5. src/java/org/apache/cassandra/config/Config.java I moved the creation of the default parameterized class to RateBasedBackPressure, to remove some pollution of constants. h5. src/java/org/apache/cassandra/net/BackPressureState.java Should we use a dedicated package? It would require a couple of setters and getters in the state to keep the properties non public, which I would prefer anyway. h5. src/java/org/apache/cassandra/net/BackPressureStrategy.java As above, should we use a dedicated package? h5. src/java/org/apache/cassandra/net/RateBasedBackPressure.java OK - I've changed the code slightly to calculate backPressureWindowHalfSize in seconds only once per call. h5. src/java/org/apache/cassandra/utils/SystemTimeSource.java Depending on the precision, in trunk there is an approximate time service with a precision of 10 ms, which is faster than calling System.currentTimeMillis() h5. src/java/org/apache/cassandra/utils/TestRateLimiter.java Is this still used? h5. src/java/org/apache/cassandra/utils/TestTimeSource.java I think we need to move it somewhere with the test sources. > Bound the number of in-flight requests at the coordinator > - > > Key: CASSANDRA-9318 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9318 > Project: Cassandra > Issue Type: Improvement > Components: Local Write-Read Paths, Streaming and Messaging >Reporter: Ariel Weisberg >Assignee: Sergio Bossa > Attachments: backpressure.png, limit.btm, no_backpressure.png > > > It's possible to somewhat bound the amount of load accepted into the cluster > by bounding the number of in-flight requests and request bytes. > An implementation might do something like track the number of outstanding > bytes and requests and if it reaches a high watermark disable read on client > connections until it goes back below some low watermark. > Need to make sure that disabling read on the client connection won't > introduce other issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15345077#comment-15345077 ] Joshua McKenzie commented on CASSANDRA-9318: [~aweisberg]: You've spent a fair amount of time thinking about this problem - would you be up for reviewing this? > Bound the number of in-flight requests at the coordinator > - > > Key: CASSANDRA-9318 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9318 > Project: Cassandra > Issue Type: Improvement > Components: Local Write-Read Paths, Streaming and Messaging >Reporter: Ariel Weisberg >Assignee: Sergio Bossa > Attachments: backpressure.png, limit.btm, no_backpressure.png > > > It's possible to somewhat bound the amount of load accepted into the cluster > by bounding the number of in-flight requests and request bytes. > An implementation might do something like track the number of outstanding > bytes and requests and if it reaches a high watermark disable read on client > connections until it goes back below some low watermark. > Need to make sure that disabling read on the client connection won't > introduce other issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15344958#comment-15344958 ] Sergio Bossa commented on CASSANDRA-9318: - I would like to reopen this ticket and propose the following patch to implement coordinator-based back-pressure: | [3.0 patch|https://github.com/apache/cassandra/compare/cassandra-3.0...sbtourist:CASSANDRA-9318-3.0?expand=1] | [testall|https://cassci.datastax.com/view/Dev/view/sbtourist/job/sbtourist-CASSANDRA-9318-3.0-testall/] | [dtest|https://cassci.datastax.com/view/Dev/view/sbtourist/job/sbtourist-CASSANDRA-9318-3.0-dtest/] | The above patch provides full-blown, end-to-end, replica-to-coordinator-to-client *write* back-pressure, based on the following main concepts: * The replica itself has no back-pressure knowledge: it keeps trying to write mutations as fast as possible, and still applies load shedding. * The coordinator tracks the back-pressure state *per replica*, which in the current implementation consists of the incoming and outgoing rate of messages from/to the replica. * The coordinator is configured with a back-pressure strategy that based on the back-pressure state, applies a given back-pressure algorithm when sending mutations to each replica. * The provided default strategy is based on the incoming/outgoing message ratio, used to rate limit outgoing messages towards a given replica. * The back-pressure strategy is also in charge of signalling the coordinator when a given replica is considered "overloaded", in which case an {{OverloadedException}} is thrown to the client for all mutation requests deemed as "overloading", until the strategy considers such overloaded state over. * The provided default strategy uses configurable low/high thresholds to either rate limit or throw exception back to clients. While all of that might seem too complex, the patch is actually surprisingly simple. I provided as many unit tests as possible, and I've also tested it on a 2-nodes CCM cluster, using [ByteMan|http://byteman.jboss.org/] to simulate a slow replica, and I'd say results are quite promising: as an example, see attached ByteMan script and plots showing a cluster with no back-pressure ending up dropping ~200k mutations, while a cluster with back-pressure enabled only ~2k, which means less coordinator overload and an easily recoverable replica state via hints. I can foresee at least two open points: * We might want to track more "back-pressure state" to allow implementing different strategies; I personally believe strategies based on in/out rates are the most appropriate ones to avoid *both* the overloading and dropped mutations problems, but people might think differently. * When the {{OverloadedException}} is (eventually) thrown, some requests might have been already sent, which is exactly what currently happens with hint overloading too: we might want to check both kinds of overloading before actually sending any mutations to replicas. Thoughts? > Bound the number of in-flight requests at the coordinator > - > > Key: CASSANDRA-9318 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9318 > Project: Cassandra > Issue Type: Improvement > Components: Local Write-Read Paths, Streaming and Messaging >Reporter: Ariel Weisberg >Assignee: Ariel Weisberg > > It's possible to somewhat bound the amount of load accepted into the cluster > by bounding the number of in-flight requests and request bytes. > An implementation might do something like track the number of outstanding > bytes and requests and if it reaches a high watermark disable read on client > connections until it goes back below some low watermark. > Need to make sure that disabling read on the client connection won't > introduce other issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15118187#comment-15118187 ] Russell Alexander Spitzer commented on CASSANDRA-9318: -- [~aweisberg] I think i can get you a repo pretty easily with the SparkCassandraConnector. We tend to get a few comments a week where the driver ends up excepting out on "NoHostAvailable" exceptions. Usually folks who have done a RF=1 and then a full table scan or are doing heavy writes to a cluster with a very high number of concurrent writers. > Bound the number of in-flight requests at the coordinator > - > > Key: CASSANDRA-9318 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9318 > Project: Cassandra > Issue Type: Improvement > Components: Local Write-Read Paths, Streaming and Messaging >Reporter: Ariel Weisberg >Assignee: Ariel Weisberg > Fix For: 2.1.x, 2.2.x > > > It's possible to somewhat bound the amount of load accepted into the cluster > by bounding the number of in-flight requests and request bytes. > An implementation might do something like track the number of outstanding > bytes and requests and if it reaches a high watermark disable read on client > connections until it goes back below some low watermark. > Need to make sure that disabling read on the client connection won't > introduce other issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15091963#comment-15091963 ] Jonathan Ellis commented on CASSANDRA-9318: --- Since most of that discussion is implementation details, I'll quote the relevant part: bq. With consistency level less then ALL mutation processing can move to background (meaning client was answered, but there is still work to do on behalf of the request). If background request rate completion is lower than incoming request rate background request will accumulate and eventually will exhaust all memory resources. This patch's aim is to prevent this situation by monitoring how much memory all current background request take and when some threshold is passed stop moving request to background (by not replying to a client until either memory consumptions moves below the threshold or request is fully completed). bq. There are two main point where each background mutation consumes memory: holding frozen mutation until operation is complete in order to hint it if it does not) and on rpc queue to each replica where it sits until it's sent out on the wire. The patch accounts for both of those separately and limits the former to be 10% of total memory and the later to be 6M. Why 6M? The best answer I can give is why not :) But on a more serious note the number should be small enough so that all the data can be sent out in a reasonable amount of time and one shard is not capable to achieve even close to a full bandwidth, so empirical evidence shows 6M to be a good number. > Bound the number of in-flight requests at the coordinator > - > > Key: CASSANDRA-9318 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9318 > Project: Cassandra > Issue Type: Improvement > Components: Local Write-Read Paths, Streaming and Messaging >Reporter: Ariel Weisberg >Assignee: Ariel Weisberg > Fix For: 2.1.x, 2.2.x > > > It's possible to somewhat bound the amount of load accepted into the cluster > by bounding the number of in-flight requests and request bytes. > An implementation might do something like track the number of outstanding > bytes and requests and if it reaches a high watermark disable read on client > connections until it goes back below some low watermark. > Need to make sure that disabling read on the client connection won't > introduce other issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15092215#comment-15092215 ] Ariel Weisberg commented on CASSANDRA-9318: --- I am not sure how they have structured things and why they need to do that. With a hard limit on ingest rate and a 2 second timeout there is a soft bound on how much memory can be committed. It's not a hard bound because there is no guarantee that the timeout thread will keep up. I do think that switching requests from async to sync based on load is a better approach than enabling/disabling read on client connections. There can be a lot of client connections so that is something that can be time consuming. I am not sold that we should do anything at all? We could add more code to address this, but I hate to do that when I can't demonstrate that I am solving a problem. I split off what I consider the good bits into CASSANDRA-10971 and CASSANDRA-10972. The delay has been other work, me considering this to be lowish priority since I can't demonstrate it, and difficulty getting hardware going that is capable of demonstrating the problem. I'm slowly iterating on it in the background right now. > Bound the number of in-flight requests at the coordinator > - > > Key: CASSANDRA-9318 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9318 > Project: Cassandra > Issue Type: Improvement > Components: Local Write-Read Paths, Streaming and Messaging >Reporter: Ariel Weisberg >Assignee: Ariel Weisberg > Fix For: 2.1.x, 2.2.x > > > It's possible to somewhat bound the amount of load accepted into the cluster > by bounding the number of in-flight requests and request bytes. > An implementation might do something like track the number of outstanding > bytes and requests and if it reaches a high watermark disable read on client > connections until it goes back below some low watermark. > Need to make sure that disabling read on the client connection won't > introduce other issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15075476#comment-15075476 ] Aleksey Yeschenko commented on CASSANDRA-9318: -- Posting myself instead: a link to a very relevant discussion by Cloudius guys: https://groups.google.com/forum/#!topic/scylladb-dev/7Ge6vHN52os > Bound the number of in-flight requests at the coordinator > - > > Key: CASSANDRA-9318 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9318 > Project: Cassandra > Issue Type: Improvement > Components: Local Write-Read Paths, Streaming and Messaging >Reporter: Ariel Weisberg >Assignee: Ariel Weisberg > Fix For: 2.1.x, 2.2.x > > > It's possible to somewhat bound the amount of load accepted into the cluster > by bounding the number of in-flight requests and request bytes. > An implementation might do something like track the number of outstanding > bytes and requests and if it reaches a high watermark disable read on client > connections until it goes back below some low watermark. > Need to make sure that disabling read on the client connection won't > introduce other issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15075444#comment-15075444 ] Ariel Weisberg commented on CASSANDRA-9318: --- HT Aleksey Scylla discussion on a similar topic https://groups.google.com/forum/#!topic/scylladb-dev/7Ge6vHN52os > Bound the number of in-flight requests at the coordinator > - > > Key: CASSANDRA-9318 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9318 > Project: Cassandra > Issue Type: Improvement > Components: Local Write-Read Paths, Streaming and Messaging >Reporter: Ariel Weisberg >Assignee: Ariel Weisberg > Fix For: 2.1.x, 2.2.x > > > It's possible to somewhat bound the amount of load accepted into the cluster > by bounding the number of in-flight requests and request bytes. > An implementation might do something like track the number of outstanding > bytes and requests and if it reaches a high watermark disable read on client > connections until it goes back below some low watermark. > Need to make sure that disabling read on the client connection won't > introduce other issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15062162#comment-15062162 ] Ariel Weisberg commented on CASSANDRA-9318: --- To summarize what I have been experimenting with. I have been limiting disk throughput with a rate limiter both with and without limiting the commit log. I am excluding the commitlog because I don't want to bottleneck the ingest rate. I have also been experimenting with limiting the rate at which remote delivered messages are consumed which more or less forces the mutations to time out. I added hint backpressure and since then I have been unable to make the server fail or OOM from slow disk or slow remote nodes. What I have found is that I am unable to get the server to start enough concurrent requests to OOM in the disk bound case because the request stage is bounded to 128 requests. In the remote mutation bound case I think I am running out of ingest capacity on a single node and that is preventing me from running up more requests in memory faster than they can be evicted by the 2 second timeout. With gigabit ethernet the reason is obvious. With 10-gigabit ethernet I would still be limited to 1.6 gigabytes of data assuming timeouts are prompt. That's a hefty quantity, but something that could fit in a larger heap. I should be able to test that soon. I noticed OutboundTcpConnection is another potential source of OOM since it can only drop messages if the thread isn't blocked writing to the socket. I am allowing it to wakeup regularly in my testing, but if something is unavailable we would have to not OOM before failure detection kicks in which is a longer timespan. It might make sense to have another thread periodically check if it should expire messages from the queue. This isn't something I can test for without access to 10-gig. have coordinate several hundred (or thousand?) megabytes of transaction data to OOM. This assumes you can get past the ingest limitations of gigabit ethernet. For code changes I think that disabling reads based on memory pressure is mostly harmless, but I can't demonstrate that it adds a lot of value given how things behave with a large enough heap and the network level bounds on ingest rate. I do want to add backpressure to hints and the compressed commit log because that is an easy path to OOM. I'll package up those changes up today. > Bound the number of in-flight requests at the coordinator > - > > Key: CASSANDRA-9318 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9318 > Project: Cassandra > Issue Type: Improvement > Components: Local Write-Read Paths, Streaming and Messaging >Reporter: Ariel Weisberg >Assignee: Ariel Weisberg > Fix For: 2.1.x, 2.2.x > > > It's possible to somewhat bound the amount of load accepted into the cluster > by bounding the number of in-flight requests and request bytes. > An implementation might do something like track the number of outstanding > bytes and requests and if it reaches a high watermark disable read on client > connections until it goes back below some low watermark. > Need to make sure that disabling read on the client connection won't > introduce other issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15056001#comment-15056001 ] Aleksey Yeschenko commented on CASSANDRA-9318: -- CASSANDRA-9834 relevant here (and related discussion in CASSANDRA-6230). > Bound the number of in-flight requests at the coordinator > - > > Key: CASSANDRA-9318 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9318 > Project: Cassandra > Issue Type: Improvement > Components: Local Write-Read Paths, Streaming and Messaging >Reporter: Ariel Weisberg >Assignee: Ariel Weisberg > Fix For: 2.1.x, 2.2.x > > > It's possible to somewhat bound the amount of load accepted into the cluster > by bounding the number of in-flight requests and request bytes. > An implementation might do something like track the number of outstanding > bytes and requests and if it reaches a high watermark disable read on client > connections until it goes back below some low watermark. > Need to make sure that disabling read on the client connection won't > introduce other issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15056411#comment-15056411 ] Ariel Weisberg commented on CASSANDRA-9318: --- Quick note. 65k mutations pending in the mutation stage. 7 memtables pending flush. [I hooked memtables pending flush into the backpressure mechanism.|https://github.com/apache/cassandra/commit/494eabf48ab48f1e86c058c0b583166ab39dcc39] That absolutely wrecked performance as throughput dropped to 0 zero periodically, but throughput is infinitely higher than when the database hasn't OOMed. Kicked off a few performance runs to demonstrate what happens when you do have backpressure and you try various large limits on in flight memtables/requests. [9318 w/backpressure 64m 8g heap memtables count|http://cstar.datastax.com/tests/id/fa769eec-a283-11e5-bbc9-0256e416528f] [9318 w/backpressure 1g 8g heap memtables count|http://cstar.datastax.com/tests/id/4c52dd6e-a286-11e5-bbc9-0256e416528f] [9318 w/backpressure 2g 8g heap memtables count|http://cstar.datastax.com/tests/id/b3d5b470-a286-11e5-bbc9-0256e416528f] I am setting the point where backpressure turns off to almost the same limit as to when it turns on. This is smooths out performance just enough for stress to not constantly emit huge numbers of errors as writes time out because the database stops serving requests for a long time waiting for a memtable to flush. With pressure from memtables somewhat accounted for the remaining source of pressure that can bring down a node is remotely delivered mutations. I can throw those into the calculation and add a listener that blocks reads from other cluster nodes. It's a nasty thing to do, but maybe not that different from OOM. I am going to hack together something to force a node to be slow so I can demonstrate overwhelming it with remotely delivered mutations first. > Bound the number of in-flight requests at the coordinator > - > > Key: CASSANDRA-9318 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9318 > Project: Cassandra > Issue Type: Improvement > Components: Local Write-Read Paths, Streaming and Messaging >Reporter: Ariel Weisberg >Assignee: Ariel Weisberg > Fix For: 2.1.x, 2.2.x > > > It's possible to somewhat bound the amount of load accepted into the cluster > by bounding the number of in-flight requests and request bytes. > An implementation might do something like track the number of outstanding > bytes and requests and if it reaches a high watermark disable read on client > connections until it goes back below some low watermark. > Need to make sure that disabling read on the client connection won't > introduce other issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15056940#comment-15056940 ] Ariel Weisberg commented on CASSANDRA-9318: --- Further review shows that with an 8 gigabyte heap backpressure isn't getting a chance to kick in. Several memtables end up being queued at once for flushing but the peak memory utilization only a few hundred megabytes. I am back to not having a workload that demonstrate that backpressure has value. I tried artificially constraining disk throughput to 32 megabytes/sec and ended up hitting OOM in hints. Hints appear to allocate ByteBuffers indefinitely so they can't provide backpressure based on available disk capacity/throughput anymore. Hints are submitted directly from the contributing thread so there is no such thing as in-flight hint anymore. I'll clean up the existing hint overload protection so that it kicks in based on number of pending buffers and see where that gets me. I am guessing that in practice hint overload is going to be a lot harder to hit now that it is based on flat files. Clearly sequential IO performance outstrips the ability to process even large incoming mutations. > Bound the number of in-flight requests at the coordinator > - > > Key: CASSANDRA-9318 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9318 > Project: Cassandra > Issue Type: Improvement > Components: Local Write-Read Paths, Streaming and Messaging >Reporter: Ariel Weisberg >Assignee: Ariel Weisberg > Fix For: 2.1.x, 2.2.x > > > It's possible to somewhat bound the amount of load accepted into the cluster > by bounding the number of in-flight requests and request bytes. > An implementation might do something like track the number of outstanding > bytes and requests and if it reaches a high watermark disable read on client > connections until it goes back below some low watermark. > Need to make sure that disabling read on the client connection won't > introduce other issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15054761#comment-15054761 ] Ariel Weisberg commented on CASSANDRA-9318: --- I got two cstar jobs to complete. [This job is set to allow 16 megabytes of transactions per coordinator, and disabled reads until they come back down to 12 megabytes.|http://cstar.datastax.com/graph?command=one_job=d1e720c8-a125-11e5-9051-0256e416528f=op_rate=1_write=1_aggregates=true=0=6664.35=0=11883.3] [This job is set to allow 64 megabytes of transactions per coordinator, and disabled reads until they came back down to 60 megabytes.|http://cstar.datastax.com/graph?command=one_job=26853362-a127-11e5-80c2-0256e416528f=op_rate=1_write=1_aggregates=true=0=322.85=0=12972.3] The job with 64 megabytes in flight kind of looks like it failed after 300 seconds. I didn't expect the threshold for things to fall apart to be quite that low, but generally speaking yeah more data in flight tends to cause bad things to happen. So why did the second one fall apart? First off mad props to whomever started collecting the GC logs. Lot's of continual full GC at the end. Sure enough the heap is only 1 gigabyte. Are we seriously running all our performance tests with a default heap of 1 gigabyte? I don't think it failed due to in flight requests (only had 32 megabytes in flight). I think it up OOMed due to other heap pressure. For this in-flight request backpressure to work I think we need to include the weight of memtables when making the decision. I am going to bump up the heap and try again to see if I can reduce the impact of other heap pressure to the point that we can start buffering more requests in flight. > Bound the number of in-flight requests at the coordinator > - > > Key: CASSANDRA-9318 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9318 > Project: Cassandra > Issue Type: Improvement > Components: Local Write-Read Paths, Streaming and Messaging >Reporter: Ariel Weisberg >Assignee: Ariel Weisberg > Fix For: 2.1.x, 2.2.x > > > It's possible to somewhat bound the amount of load accepted into the cluster > by bounding the number of in-flight requests and request bytes. > An implementation might do something like track the number of outstanding > bytes and requests and if it reaches a high watermark disable read on client > connections until it goes back below some low watermark. > Need to make sure that disabling read on the client connection won't > introduce other issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15053719#comment-15053719 ] Ariel Weisberg commented on CASSANDRA-9318: --- I tried out limiting based on memory utilization. What I found was that for the small amount of throughput I can get out of my desktop the 2 second timeout is sufficient to evict and hint without OOM. If I extend the timeout enough I can get an OOM because eviction doesn't keep up so that demonstrates that eviction has to take place to avoid OOM. I see this change as useful in that it places a hard bound on working set size, but not sufficient. Performance tanks so badly as the heap fills up with requests being timed out that evicting them is not a problem. If that is the case maybe we should just be evicting them more aggressively so they don't have an impact on performance. Possibly based on perceived odds of receiving a response in a reasonable amount of time. It makes sense to me to use hinting as a method of getting the data off the heap and batching the replication to slow nodes or DCs. If we start evicting requests for a node maybe we should have an adaptive approach and go straight to hinting for the slow node for some % of requests. If the non-immediately hinted requests start succeeding we can gradually increase the % that go straight to the node. I am trying to think of alternatives that don't end up kicking back an error to the application. That's still an important capability to have because growing hints forever is not great, but we can start by ensuring that rest of the cluster can always operate at full speed even if a node is slow. Separately we can tackle bounding the resource utilization issues that presents. Operationally how do people feel about having many gigabytes worth of hints to deliver? Is that useful in that it allows things to continue until the slow node is addressed? > Bound the number of in-flight requests at the coordinator > - > > Key: CASSANDRA-9318 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9318 > Project: Cassandra > Issue Type: Improvement > Components: Local Write-Read Paths, Streaming and Messaging >Reporter: Ariel Weisberg >Assignee: Ariel Weisberg > Fix For: 2.1.x, 2.2.x > > > It's possible to somewhat bound the amount of load accepted into the cluster > by bounding the number of in-flight requests and request bytes. > An implementation might do something like track the number of outstanding > bytes and requests and if it reaches a high watermark disable read on client > connections until it goes back below some low watermark. > Need to make sure that disabling read on the client connection won't > introduce other issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15051559#comment-15051559 ] Ariel Weisberg commented on CASSANDRA-9318: --- The discussion so far has lead to throwing OE or UE to the client being too complex starting point since the database will throw off more errors then it used to. The advantage of a bound on just memory committed at the coordinator is that it's "free" from the perspective of the user. OOM and blocking are similar levels of unavailability. The difference is that OOM is more disruptive. If the coordinator at least prevents itself from OOMing it can recover without intervention and continue to serve writes. I think we are being optimistic about how many writes you can buffer and time out at coordinators before things start going south. The space amplification is going to be bad and the GC overhead increases linearly with the number of timing out writes in flight. It's going to lead to survivor copying and promotion which is going to lead to fragmentation. I think we will find that we can't buffer enough writes to tolerate a 2 second timeout in a small cluster where a single coordinator must coordinate many requests for a slow node. I think we are going to see what Benedict described, especially in small clusters, where performance is cyclical as the coordinators block waiting for timeouts to fire. I have an implementation of most of this. Fussing over how to calculate the weight of inflight callbacks and mutations without introducing a lot of overhead walking the graph repeatedly. We already walk it once to calculate the serialized size, but that is not the same thing. I usually just go with a rough guesstimate, serialized size is a good proxy for a weight, but we are trying to fill the heap reliably. It's bad enough not knowing the layout of the heap or the garbage collector induced overhead. > Bound the number of in-flight requests at the coordinator > - > > Key: CASSANDRA-9318 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9318 > Project: Cassandra > Issue Type: Improvement >Reporter: Ariel Weisberg >Assignee: Ariel Weisberg > Fix For: 2.1.x, 2.2.x > > > It's possible to somewhat bound the amount of load accepted into the cluster > by bounding the number of in-flight requests and request bytes. > An implementation might do something like track the number of outstanding > bytes and requests and if it reaches a high watermark disable read on client > connections until it goes back below some low watermark. > Need to make sure that disabling read on the client connection won't > introduce other issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14971574#comment-14971574 ] Sergio Bossa commented on CASSANDRA-9318: - bq. The whole point of this ticket is to avoid the complexity of intra-node backpressure, and instead basing coordinator -> client backpressure on the coordinator's local knowledge. Just to be clear, my proposal _does_ keep the back-pressure decision local to the coordinator, that is, there's no communication between nodes in such regard (which I agree would be a much different matter). The difference in my proposal is that we deal with back-pressure on a per-replica basis and in a fine grained way, rather than with coarse grained global memory limits which would end up flooding replicas before being triggered. > Bound the number of in-flight requests at the coordinator > - > > Key: CASSANDRA-9318 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9318 > Project: Cassandra > Issue Type: Improvement >Reporter: Ariel Weisberg >Assignee: Jacek Lewandowski > Fix For: 2.1.x, 2.2.x > > > It's possible to somewhat bound the amount of load accepted into the cluster > by bounding the number of in-flight requests and request bytes. > An implementation might do something like track the number of outstanding > bytes and requests and if it reaches a high watermark disable read on client > connections until it goes back below some low watermark. > Need to make sure that disabling read on the client connection won't > introduce other issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14971157#comment-14971157 ] Jonathan Ellis commented on CASSANDRA-9318: --- bq. I'd like to resurrect this one and if time permits take it by following Jonathan's proposal above, except I'd also like to propose an additional form of back-pressure at the coordinator->replica level. The whole point of this ticket is to avoid the complexity of intra-node backpressure, and instead basing coordinator -> client backpressure on the coordinator's local knowledge. (Which is not a perfect view of cluster state but it is a reasonable approximation, especially since it can use its MessagingService state to guess replica problems, even without explicit replica backpressure.) We can always add explicit intra-node backpressure later, they are complementary. But I'm not sure it will be necessary and I want to avoid over-engineering the problem to start. > Bound the number of in-flight requests at the coordinator > - > > Key: CASSANDRA-9318 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9318 > Project: Cassandra > Issue Type: Improvement >Reporter: Ariel Weisberg >Assignee: Jacek Lewandowski > Fix For: 2.1.x, 2.2.x > > > It's possible to somewhat bound the amount of load accepted into the cluster > by bounding the number of in-flight requests and request bytes. > An implementation might do something like track the number of outstanding > bytes and requests and if it reaches a high watermark disable read on client > connections until it goes back below some low watermark. > Need to make sure that disabling read on the client connection won't > introduce other issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14960450#comment-14960450 ] Sergio Bossa commented on CASSANDRA-9318: - I'd like to resurrect this one and if time permits take it by following Jonathan's proposal above, except I'd also like to propose an additional form of back-pressure at the coordinator->replica level. Such beack-pressure would be applied by the coordinator when sending messages to *each* replica if some kind of flow control condition is met (i.e. number of in-flight requests, drop rate, we can talk about this more in a second time or even experiment); that is, each replica would have its own flow control, allowing to better fine tune the applied back-pressure. The memory-based back-pressure would at this point work as a kind of circuit breaker: if replicas can't keep up, and the applied flow control causes too many requests to accumulate on the coordinator, the memory-based limit will kick in and start pushing back to the client by either pausing or throwing OveloadedException. There are obviously details we need to discuss and/or experiment with, i.e.: 1) The flow control algorithm (we could steal from the TCP literature, using something like CoDel or Adaptive RED). 2) If posing any limit to coordinator-level throttling, i.e. shedding requests that have been throttled for too much time (I would say no, because the memory limit should protect against OOMs and allow the in-flight requests to be processed). 3) What to do when the memory limit is reached (we could make this policy-based). I hope it makes sense, and I hope you see the reason behind that: dropped mutations are a problem for many C* users, and even more for C* applications that cannot rely on QUORUM reads (i.e. inverted index queries, graph queries); the proposal above is not meant to be the definitive solution, but should greatly help reducing the number of dropped mutations on replicas, which memory-based back-pressure alone doesn't (as by the time you kick it, without flow control replicas will be already flooded with requests). > Bound the number of in-flight requests at the coordinator > - > > Key: CASSANDRA-9318 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9318 > Project: Cassandra > Issue Type: Improvement >Reporter: Ariel Weisberg >Assignee: Jacek Lewandowski > Fix For: 2.1.x, 2.2.x > > > It's possible to somewhat bound the amount of load accepted into the cluster > by bounding the number of in-flight requests and request bytes. > An implementation might do something like track the number of outstanding > bytes and requests and if it reaches a high watermark disable read on client > connections until it goes back below some low watermark. > Need to make sure that disabling read on the client connection won't > introduce other issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605351#comment-14605351 ] Benedict commented on CASSANDRA-9318: - Of course, things get even hairier with multi-DC, but I'm not as familiar with the logic there. It looks naively that a single node could quickly bring down every DC. Bound the number of in-flight requests at the coordinator - Key: CASSANDRA-9318 URL: https://issues.apache.org/jira/browse/CASSANDRA-9318 Project: Cassandra Issue Type: Improvement Reporter: Ariel Weisberg Assignee: Ariel Weisberg Fix For: 2.1.x, 2.2.x It's possible to somewhat bound the amount of load accepted into the cluster by bounding the number of in-flight requests and request bytes. An implementation might do something like track the number of outstanding bytes and requests and if it reaches a high watermark disable read on client connections until it goes back below some low watermark. Need to make sure that disabling read on the client connection won't introduce other issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605600#comment-14605600 ] Benedict commented on CASSANDRA-9318: - That said, in general (perhaps in a separate ticket) we should probably make our heap calculations a bit more robust wrt each other. i.e. we should subtract memtable space from any heap apportionment, in case users set memtable space really high. Bound the number of in-flight requests at the coordinator - Key: CASSANDRA-9318 URL: https://issues.apache.org/jira/browse/CASSANDRA-9318 Project: Cassandra Issue Type: Improvement Reporter: Ariel Weisberg Assignee: Ariel Weisberg Fix For: 2.1.x, 2.2.x It's possible to somewhat bound the amount of load accepted into the cluster by bounding the number of in-flight requests and request bytes. An implementation might do something like track the number of outstanding bytes and requests and if it reaches a high watermark disable read on client connections until it goes back below some low watermark. Need to make sure that disabling read on the client connection won't introduce other issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605597#comment-14605597 ] Benedict commented on CASSANDRA-9318: - bq. default timeout is 2s not 10, so actually fine in your example of 300MB vs 150MB/s x 2s Looks like 2.0 this was 10s, and it was hard-coded in yaml, so anyone upgrading from 2.0 or before likely has a 10s timeout. So we should assume this is by far the most common timeout. bq. you don't see a complete halt until capacity's worth of requests timeout all at once, because you don't get an entire capacity load accepted at once. it's more continuous than discrete – you pause until the oldest expire, accept more, pause until the oldest expire, etc. so you make slow progress as load shedding can free up memory. thus, load shedding is complementary to flow control. You see a complete halt as soon as we exhaust space. If we exhaust space in 0.5x timeout, then we will see repeatedly juddering behaviour. bq. but we can easily set a higher limit on MS heap – maybe as high as 1/8 heap as default which gives us a lot of room for 8GB heap If we set this really _aggressively_ high, say min(1/4 heap, 1Gb) until we implement the improved shedding, then I'll quit complaining. Right now we give breathing room up to and beyond collapse. I absolutely agree that breathing room up until just-prior-to-collapse is preferable, but cutting our breathing room by a magnitude is reducing our availability in clusters without their opting into it. 1/4 heap is probably still leaving quite a lot of headroom we would otherwise have safely used in a 2Gb heap (which are quite feasible, and probably preferable, for many users running offheap memtables), but is still very unlikely to cause the server to completely collapse. Bound the number of in-flight requests at the coordinator - Key: CASSANDRA-9318 URL: https://issues.apache.org/jira/browse/CASSANDRA-9318 Project: Cassandra Issue Type: Improvement Reporter: Ariel Weisberg Assignee: Ariel Weisberg Fix For: 2.1.x, 2.2.x It's possible to somewhat bound the amount of load accepted into the cluster by bounding the number of in-flight requests and request bytes. An implementation might do something like track the number of outstanding bytes and requests and if it reaches a high watermark disable read on client connections until it goes back below some low watermark. Need to make sure that disabling read on the client connection won't introduce other issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605629#comment-14605629 ] Jonathan Ellis commented on CASSANDRA-9318: --- bq. If we set this really aggressively high, say min(1/4 heap, 1Gb) until we implement the improved shedding, then I'll quit complaining. Sold! Bound the number of in-flight requests at the coordinator - Key: CASSANDRA-9318 URL: https://issues.apache.org/jira/browse/CASSANDRA-9318 Project: Cassandra Issue Type: Improvement Reporter: Ariel Weisberg Assignee: Ariel Weisberg Fix For: 2.1.x, 2.2.x It's possible to somewhat bound the amount of load accepted into the cluster by bounding the number of in-flight requests and request bytes. An implementation might do something like track the number of outstanding bytes and requests and if it reaches a high watermark disable read on client connections until it goes back below some low watermark. Need to make sure that disabling read on the client connection won't introduce other issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605539#comment-14605539 ] Jonathan Ellis commented on CASSANDRA-9318: --- * default timeout is 2s not 10, so actually fine in your example of 300MB vs 150MB/s x 2s * but we can easily set a higher limit on MS heap -- maybe as high as 1/8 heap as default which gives us a *lot* of room for 8GB heap * you don't see a complete halt until capacity's worth of requests timeout all at once, because you don't get an entire capacity load accepted at once. it's more continuous than discrete -- you pause until the oldest expire, accept more, pause until the oldest expire, etc. so you make slow progress as load shedding can free up memory. thus, load shedding is complementary to flow control. * aggressively load shedding for outlier nodes is a good idea that we should follow up on in another ticket. again, current behavior of continuing to accept requests until we fall over is worse than imposing flow control, so we should start with that [flow control] in 2.1 and make further improvements in 2.2. Bound the number of in-flight requests at the coordinator - Key: CASSANDRA-9318 URL: https://issues.apache.org/jira/browse/CASSANDRA-9318 Project: Cassandra Issue Type: Improvement Reporter: Ariel Weisberg Assignee: Ariel Weisberg Fix For: 2.1.x, 2.2.x It's possible to somewhat bound the amount of load accepted into the cluster by bounding the number of in-flight requests and request bytes. An implementation might do something like track the number of outstanding bytes and requests and if it reaches a high watermark disable read on client connections until it goes back below some low watermark. Need to make sure that disabling read on the client connection won't introduce other issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605779#comment-14605779 ] Benedict commented on CASSANDRA-9318: - This still leaves some questions, of varying import and difficulty: * how can we easily block consumption of writes from clients without stopping reads? * how should we mandate (or encourage) native clients to behave: ** separate read/write connections? ** configurable blocking queue size for pending writes to avoid unbounded growth? ** should timeouts be from time of despatch, or from time of submission to local client? * will we implement this for thrift clients? Bound the number of in-flight requests at the coordinator - Key: CASSANDRA-9318 URL: https://issues.apache.org/jira/browse/CASSANDRA-9318 Project: Cassandra Issue Type: Improvement Reporter: Ariel Weisberg Assignee: Ariel Weisberg Fix For: 2.1.x, 2.2.x It's possible to somewhat bound the amount of load accepted into the cluster by bounding the number of in-flight requests and request bytes. An implementation might do something like track the number of outstanding bytes and requests and if it reaches a high watermark disable read on client connections until it goes back below some low watermark. Need to make sure that disabling read on the client connection won't introduce other issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605814#comment-14605814 ] Benedict commented on CASSANDRA-9318: - Do you mean within a single connection? I was under the impression we absolutely didn't want to stop readers? I disagree about not imposing _expectations_ on client implementations, so that users don't need to reason about each connector they use independently. Along with the protocol spec, other specs such as behaviour in these scenarios should really be spelled out. If clients want to do their own thing, there's not much we can do, but it's helpful for users to have an expectation of behaviour, and for implementors to be made aware of the potential problems their driver may encounter that they do not anticipate (and what everyone will expect them to do). Bound the number of in-flight requests at the coordinator - Key: CASSANDRA-9318 URL: https://issues.apache.org/jira/browse/CASSANDRA-9318 Project: Cassandra Issue Type: Improvement Reporter: Ariel Weisberg Assignee: Ariel Weisberg Fix For: 2.1.x, 2.2.x It's possible to somewhat bound the amount of load accepted into the cluster by bounding the number of in-flight requests and request bytes. An implementation might do something like track the number of outstanding bytes and requests and if it reaches a high watermark disable read on client connections until it goes back below some low watermark. Need to make sure that disabling read on the client connection won't introduce other issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605808#comment-14605808 ] Jonathan Ellis commented on CASSANDRA-9318: --- # We can't and we shouldn't # See Above # No Bound the number of in-flight requests at the coordinator - Key: CASSANDRA-9318 URL: https://issues.apache.org/jira/browse/CASSANDRA-9318 Project: Cassandra Issue Type: Improvement Reporter: Ariel Weisberg Assignee: Ariel Weisberg Fix For: 2.1.x, 2.2.x It's possible to somewhat bound the amount of load accepted into the cluster by bounding the number of in-flight requests and request bytes. An implementation might do something like track the number of outstanding bytes and requests and if it reaches a high watermark disable read on client connections until it goes back below some low watermark. Need to make sure that disabling read on the client connection won't introduce other issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605817#comment-14605817 ] Jonathan Ellis commented on CASSANDRA-9318: --- https://issues.apache.org/jira/browse/CASSANDRA-9318?focusedCommentId=14604775page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14604775 Bound the number of in-flight requests at the coordinator - Key: CASSANDRA-9318 URL: https://issues.apache.org/jira/browse/CASSANDRA-9318 Project: Cassandra Issue Type: Improvement Reporter: Ariel Weisberg Assignee: Ariel Weisberg Fix For: 2.1.x, 2.2.x It's possible to somewhat bound the amount of load accepted into the cluster by bounding the number of in-flight requests and request bytes. An implementation might do something like track the number of outstanding bytes and requests and if it reaches a high watermark disable read on client connections until it goes back below some low watermark. Need to make sure that disabling read on the client connection won't introduce other issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605825#comment-14605825 ] Benedict commented on CASSANDRA-9318: - My mistake, I clearly misread your earlier comment. I'm not sure I agree with that conclusion, but not strongly enough to prolong the discussion. So I guess that's that then. Bound the number of in-flight requests at the coordinator - Key: CASSANDRA-9318 URL: https://issues.apache.org/jira/browse/CASSANDRA-9318 Project: Cassandra Issue Type: Improvement Reporter: Ariel Weisberg Assignee: Ariel Weisberg Fix For: 2.1.x, 2.2.x It's possible to somewhat bound the amount of load accepted into the cluster by bounding the number of in-flight requests and request bytes. An implementation might do something like track the number of outstanding bytes and requests and if it reaches a high watermark disable read on client connections until it goes back below some low watermark. Need to make sure that disabling read on the client connection won't introduce other issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605345#comment-14605345 ] Benedict commented on CASSANDRA-9318: - bq. I think you're \[Benedict\] overselling how scary it is to stop reading new requests until we can free up some memory from MS. The problem is that freeing up memory can be constrained by one or a handful of _dead_ nodes. We can not only stop accepting work, but significantly reduce cluster throughput as a result of a *single* timed-out node. I'm not overselling anything, although I may have a different risk analysis than you do. Take a simple mathematical thought experiment: we have a four node cluster (pretty common), with RF=3, serving 100kop/s per coordinator; these operations in memory occupy around 2K as a Mutation (again, pretty typical). Ordinary round-trip time is 10ms (also, pretty typical). So, under normal load we would see around 2Mb of data maintained for our queries in-flight across the cluster. But now one node goes down. This node is a peer for 3/4 of all writes to the cluster, so we see 150Mb/s of data accumulate in each coordinator. Our limit is probably no more than 300Mb (probably lower). Our timeout is 10s. So we now have 8s during which nothing can be done, across the cluster, due to one node's death. After that 8s has elapsed, we get another flurry. Then another 8s of nothing. Even with a CL of ONE. This really is fundamentally opposed to the whole idea of Cassandra, and I cannot emphasize how much I am against it except as a literal last resort when all other strategies have failed. bq. Hinting is better than leaving things in an unknown state but it's not something we should opt users into if we have a better option, since it basically turns the write into CL.ANY. I was under the impression we had moved to talking about ACK'd writes. I'm not suggesting we ack with success to the handler. What we do with unack'd writes is actually less important, and we have a much freer reign with. We could throw OE. We could block, as you suggest, since these should be more evenly distributed. However I would prefer we do both, i.e., when we run out of room in the coordinator, we should look to see if there are any nodes that have well in excess of their fair share of entries waiting for a response. Let's call these nodes N # if N=0, we block consumption of new writes, as you propose. # otherwise, we first evict those that have been ACK'd to the client and can be safely hinted (and hint them) # if this isn't enough, we evict handlers that, if all N were to fail, would break the CL we are waiting on, and we throw OE step 3 is necessary both for CL.ALL, and the scenario where 2 failing nodes have significant overlap Bound the number of in-flight requests at the coordinator - Key: CASSANDRA-9318 URL: https://issues.apache.org/jira/browse/CASSANDRA-9318 Project: Cassandra Issue Type: Improvement Reporter: Ariel Weisberg Assignee: Ariel Weisberg Fix For: 2.1.x, 2.2.x It's possible to somewhat bound the amount of load accepted into the cluster by bounding the number of in-flight requests and request bytes. An implementation might do something like track the number of outstanding bytes and requests and if it reaches a high watermark disable read on client connections until it goes back below some low watermark. Need to make sure that disabling read on the client connection won't introduce other issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14604662#comment-14604662 ] Benedict commented on CASSANDRA-9318: - I'm pretty sure I've made clear a few times that I'm proposing load shedding based on _both_ resource consumption and timeout. i.e. if we are running out of resources, we hint, if we completely run out of resources, we shed. In this case, shedding is _never_ incapable of keeping us in a happy place, and ensures we absolutely prevent any spam bringing down the server. I think we need to really separate the two concerns, as we seem to be jumping between them: keeping the server alive is best done through shedding; helping users with bulk loaders is best served by pausing single clients that are exceeding our rate of consumption. Bound the number of in-flight requests at the coordinator - Key: CASSANDRA-9318 URL: https://issues.apache.org/jira/browse/CASSANDRA-9318 Project: Cassandra Issue Type: Improvement Reporter: Ariel Weisberg Assignee: Ariel Weisberg Fix For: 2.2.x It's possible to somewhat bound the amount of load accepted into the cluster by bounding the number of in-flight requests and request bytes. An implementation might do something like track the number of outstanding bytes and requests and if it reaches a high watermark disable read on client connections until it goes back below some low watermark. Need to make sure that disabling read on the client connection won't introduce other issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14604649#comment-14604649 ] Jonathan Ellis commented on CASSANDRA-9318: --- Here's where I've ended up: # Continuing to accept writes faster than a coordinator can deliver them to replicas is bad. Even perfect load shedding is worse from a client perspective than throttling, since if we load shed and time out the client needs to try to guess the right rate to retry at. # For the same reason, accepting a write but then refusing it with UnavailableException is worse than waiting to accept the write until we have capacity for it. # It's more important to throttle writes because while we can get in trouble with large reads too (a small request turns into a big reply), in practice reads are naturally throttled because a client needs to wait for the read before taking action on it. With writes on the other hand a new user's first inclination is to see how fast s/he can bulk load stuff. In practice, I see load shedding and throttling as complementary. Replicas can continue to rely on load shedding. Perhaps we can attempt distributed back pressure later (if every replica is overloaded, we should again throttle clients) but for now let's narrow our scope to throttling clients to the capacity of a coordinator to send out. I propose we define a limit on the amount of memory MessagingService can consume and pause reading additional requests whenever that limit is hit. Note that: # If MS's load is distributed evenly across all destinations then this is trivially the right thing to do. # If MS's load is caused by a single replica falling over or unable to keep up, this is still the right thing to do because the alternative is worse. MS will load shed timed out requests, but if clients are sending more requests to a single replica than we can shed (if rate * timeout capacity) then we still need to throttle or we will exhaust the heap and fall over. (The hint-based UnavailableException tries to help with scenario 2, and I will open a ticket to test how well that actually works. But the hint threshold cannot help with scenario 1 at all and that is the hole this ticket needs to plug.) Bound the number of in-flight requests at the coordinator - Key: CASSANDRA-9318 URL: https://issues.apache.org/jira/browse/CASSANDRA-9318 Project: Cassandra Issue Type: Improvement Reporter: Ariel Weisberg Assignee: Ariel Weisberg Fix For: 2.2.x It's possible to somewhat bound the amount of load accepted into the cluster by bounding the number of in-flight requests and request bytes. An implementation might do something like track the number of outstanding bytes and requests and if it reaches a high watermark disable read on client connections until it goes back below some low watermark. Need to make sure that disabling read on the client connection won't introduce other issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14604656#comment-14604656 ] Benedict commented on CASSANDRA-9318: - bq. If MS's load is caused by a single replica falling over or unable to keep up, this is still the right thing to do... or we will exhaust the heap and fall over. How does load shedding (or immediately hinting) not prevent this scenario? The proposal you're making appreciably harms our availability guarantees. Load shedding and/or hinting does not, and it fulfils this most important criterion. If we pause accepting requests _from a single client_ once that client is using in excess of a lower watermark (based on some fair share of available memory in the MS), and _only that client_ is affected, I think that is an acceptably constrained loss of availability. Enforcing it globally seems to me to far too significantly harm our most central USP. Bound the number of in-flight requests at the coordinator - Key: CASSANDRA-9318 URL: https://issues.apache.org/jira/browse/CASSANDRA-9318 Project: Cassandra Issue Type: Improvement Reporter: Ariel Weisberg Assignee: Ariel Weisberg Fix For: 2.2.x It's possible to somewhat bound the amount of load accepted into the cluster by bounding the number of in-flight requests and request bytes. An implementation might do something like track the number of outstanding bytes and requests and if it reaches a high watermark disable read on client connections until it goes back below some low watermark. Need to make sure that disabling read on the client connection won't introduce other issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14604697#comment-14604697 ] Jonathan Ellis commented on CASSANDRA-9318: --- # You can't just shed indiscriminately without breaking the contract that once you've accepted a write (and not handed back UnavailableException) then you must deliver it or hint it. You can't just drop it on the floor even to save yourself from falling over. (If you fall over then at least it's clear to the user that you weren't able to fulfill your contract. No, logging it isn't good enough.) # Again, shedding is strictly worse from a user's point of view than not accepting a write we can't handle. Bound the number of in-flight requests at the coordinator - Key: CASSANDRA-9318 URL: https://issues.apache.org/jira/browse/CASSANDRA-9318 Project: Cassandra Issue Type: Improvement Reporter: Ariel Weisberg Assignee: Ariel Weisberg Fix For: 2.2.x It's possible to somewhat bound the amount of load accepted into the cluster by bounding the number of in-flight requests and request bytes. An implementation might do something like track the number of outstanding bytes and requests and if it reaches a high watermark disable read on client connections until it goes back below some low watermark. Need to make sure that disabling read on the client connection won't introduce other issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)