[jira] [Comment Edited] (CASSANDRA-7937) Apply backpressure gently when overloaded with writes
[ https://issues.apache.org/jira/browse/CASSANDRA-7937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15328373#comment-15328373 ] Ariel Weisberg edited comment on CASSANDRA-7937 at 6/13/16 9:51 PM: I think we can make this situation better, and I mentioned some ideas at NGCC and in CASSANDRA-11327. There are two issues. The first is that if flushing falls behind throughput falls to zero instead of progressing at the rate at which flushing progresses which is usually not zero. Right now it looks like it is zero because flushing doesn't release any memory as it progresses and is all or nothing. Aleksey mentioned we could do something like early opening for flushing so that memory is made available sooner. Alternatively we could overcommit and then gradually release memory as flushing progresses. The second, and this isn't really related to backpressure, is that flushing falls behind in several reasonable configurations. Ingest has gotten faster and I don't think flushing has as much so it's easier for it to fall behind if it's driven by a single thread against a busy device (even a fast SSD). I haven't tested this yet, but I suspect that if you use multiple JBOD paths for a fast device like an SSD and increase memtable_flush_writers you will get enough additional flushing throughput to keep up with ingest. Right now flushing is single threaded for a single path and only one flush can occur at any time. Flushing falling behind is more noticeable when you let compaction have more threads and a bigger rate limit because it can dirty enough memory in the filesystem cache that when it flushes it causes a temporally localized slowdown in flushing that is enough to cause timeouts when there is no more free memory because flushing didn't finish soon enough. I think the long term solution is that the further flushing falls behind the more concurrent flush threads we start deploying kind of like compaction up to the configured limit. [Right now there is a single thread scheduling them and waiting on the result.|https://github.com/apache/cassandra/blob/cassandra-3.7/src/java/org/apache/cassandra/db/ColumnFamilyStore.java#L1130] memtable_flush_writers doesn't help due to this code here that only [generates more flush runnables for a memtable if there are multiple directories|https://github.com/apache/cassandra/blob/cassandra-3.7/src/java/org/apache/cassandra/db/Memtable.java#L278]. C* is already divvying up the heap using memtable_cleanup_threshold which would allow for concurrent flushing it's just not actually flushing concurrently. was (Author: aweisberg): I think we can make this situation better, and I mentioned some ideas at NGCC and in CASSANDRA-11327. There are two issues. The first is that if flushing falls behind throughput falls to zero instead of progressing at the rate at which flushing progresses which is usually not zero. Right now it looks like it is zero because flushing doesn't release any memory as it progresses and is all or nothing. Aleksey mentioned we could do something like early opening for flushing so that memory is made available sooner. Alternatively we could overcommit and then gradually release memory as flushing progresses. The second is that flushing falls behind in several reasonable configurations. Ingest has gotten faster and I don't think flushing has as much so it's easier for it to fall behind if it's driven by a single thread against a busy device (even a fast SSD). I haven't tested this yet, but I suspect that if you use multiple JBOD paths for a fast device like an SSD and increase memtable_flush_writers you will get enough additional flushing throughput to keep up with ingest. Right now flushing is single threaded for a single path and only one flush can occur at any time. Flushing falling behind is more noticeable when you let compaction have more threads and a bigger rate limit because it can dirty enough memory in the filesystem cache that when it flushes it causes a temporally localized slowdown in flushing that is enough to cause timeouts when there is no more free memory because flushing didn't finish soon enough. I think the long term solution is that the further flushing falls behind the more concurrent flush threads we start deploying kind of like compaction up to the configured limit. [Right now there is a single thread scheduling them and waiting on the result.|https://github.com/apache/cassandra/blob/cassandra-3.7/src/java/org/apache/cassandra/db/ColumnFamilyStore.java#L1130] memtable_flush_writers doesn't help due to this code here that only [generates more flush runnables for a memtable if there are multiple directories|https://github.com/apache/cassandra/blob/cassandra-3.7/src/java/org/apache/cassandra/db/Memtable.java#L278]. C* is already divvying up the heap using memtable_cleanup_threshold which
[jira] [Comment Edited] (CASSANDRA-7937) Apply backpressure gently when overloaded with writes
[ https://issues.apache.org/jira/browse/CASSANDRA-7937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14259967#comment-14259967 ] Piotr Kołaczkowski edited comment on CASSANDRA-7937 at 12/29/14 9:52 AM: - I like the idea of using parallelism level as a backpressure mechanism. That would have a nice positive effect of automatically reducing the amount of memory used for queuing the requests. However, I have a few concerns: 1. my biggest concern is, that even limiting a single client to one write at a time (window size = 1), might still be too fast, for some fast clients, if only row sizes are big enough, particularly when writing big cells of data, where big = hundreds of kB / single MBs per cell. Cassandra is extremely efficient at ingesting data into memtables. If it was faster than we're able to flush, then we still have a problem. 2. when not using TokenAware LBP, we can have a problem that in case of unbalanced write load, the fast (not loaded) nodes would be needlessly penalized with lower parallelism level, because the window size would be per-coordinator, not per partition (replica set). Am I right here? I wonder if using a penalty delay *after* processing the write request, would not be a better idea: in case of an unbalanced load, when one partition gets hammered, but most others are ok, it would slow down writes for that one partition (replica set), but would not affect latency of writes of other partitions. I'm for applyting the delay after, because then we already know the replica set and their load, as well as we don't need to keep data queued in memory, for it has already been written. The (simplified) process would look as follows: 1. the write request gets accepted by the coordinator 2. the write gets sent to proper replica nodes 3. the replicas that acknowledge the write, also reply with their current load information 4. received load information gets averaged (median) 5. when the CL is satisfied, but the load was high the coordinator puts some artificial delay before acknowledging the write to the client - of course small enough to not exceed the write timeout. Or if not possible / wise for some other reasons (e.g. holding memory), just tells the driver that load was high, you better slow down and the driver waits before processing the next request. WDYT? WDYT? was (Author: pkolaczk): I like the idea of using parallelism level as a backpressure mechanism. That would have a nice positive effect of automatically reducing the amount of memory used for queuing the requests. However, my biggest concern is, that even limiting a single client to one write at a time (window size = 1), might still be too fast, for some fast clients, if only row sizes are big enough, particularly when writing big cells of data, where big = hundreds of kB / single MBs per cell. Cassandra is extremely efficient at ingesting data into memtables. If it was faster than we're able to flush, then we still have a problem. So I guess if, after going down to parallelism level = 1 and still being too fast (e.g. flush queue full, last memtable almost full), we could tell the client a message please do not send data faster than X MB/s now and the client (driver) could do some artificial delay before processing the next request. Apply backpressure gently when overloaded with writes - Key: CASSANDRA-7937 URL: https://issues.apache.org/jira/browse/CASSANDRA-7937 Project: Cassandra Issue Type: Bug Components: Core Environment: Cassandra 2.0 Reporter: Piotr Kołaczkowski Labels: performance When writing huge amounts of data into C* cluster from analytic tools like Hadoop or Apache Spark, we can see that often C* can't keep up with the load. This is because analytic tools typically write data as fast as they can in parallel, from many nodes and they are not artificially rate-limited, so C* is the bottleneck here. Also, increasing the number of nodes doesn't really help, because in a collocated setup this also increases number of Hadoop/Spark nodes (writers) and although possible write performance is higher, the problem still remains. We observe the following behavior: 1. data is ingested at an extreme fast pace into memtables and flush queue fills up 2. the available memory limit for memtables is reached and writes are no longer accepted 3. the application gets hit by write timeout, and retries repeatedly, in vain 4. after several failed attempts to write, the job gets aborted Desired behaviour: 1. data is ingested at an extreme fast pace into memtables and flush queue fills up 2. after exceeding some memtable fill threshold, C* applies adaptive rate limiting to writes - the
[jira] [Comment Edited] (CASSANDRA-7937) Apply backpressure gently when overloaded with writes
[ https://issues.apache.org/jira/browse/CASSANDRA-7937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14259967#comment-14259967 ] Piotr Kołaczkowski edited comment on CASSANDRA-7937 at 12/29/14 9:54 AM: - I like the idea of using parallelism level as a backpressure mechanism. That would have a nice positive effect of automatically reducing the amount of memory used for queuing the requests. However, I have a few concerns: # my biggest concern is, that even limiting a single client to one write at a time (window size = 1), might still be too fast, for some fast clients, if only row sizes are big enough, particularly when writing big cells of data, where big = hundreds of kB / single MBs per cell. Cassandra is extremely efficient at ingesting data into memtables. If it was faster than we're able to flush, then we still have a problem. # when not using TokenAware LBP, we can have a problem that in case of unbalanced write load, the fast (not loaded) nodes would be needlessly penalized with lower parallelism level, because the window size would be per-coordinator, not per partition (replica set). Am I right here? I wonder if using a penalty delay *after* processing the write request, would not be a better idea: in case of an unbalanced load, when one partition gets hammered, but most others are ok, it would slow down writes for that one partition (replica set), but would not affect latency of writes of other partitions. I'm for applyting the delay after, because then we already know the replica set and their load, as well as we don't need to keep data queued in memory, for it has already been written. The (simplified) process would look as follows: # the write request gets accepted by the coordinator # the write gets sent to proper replica nodes # the replicas that acknowledge the write, also reply with their current load information # received load information gets averaged (median) # when the CL is satisfied, but the load was high enough to be in the dangerous zone the coordinator puts some artificial delay before acknowledging the write to the client - of course small enough to not exceed the write timeout. Or if not possible / wise for some other reasons (e.g. holding memory), just tells the driver that load was high, you better slow down and the driver waits before processing the next request. WDYT? was (Author: pkolaczk): I like the idea of using parallelism level as a backpressure mechanism. That would have a nice positive effect of automatically reducing the amount of memory used for queuing the requests. However, I have a few concerns: 1. my biggest concern is, that even limiting a single client to one write at a time (window size = 1), might still be too fast, for some fast clients, if only row sizes are big enough, particularly when writing big cells of data, where big = hundreds of kB / single MBs per cell. Cassandra is extremely efficient at ingesting data into memtables. If it was faster than we're able to flush, then we still have a problem. 2. when not using TokenAware LBP, we can have a problem that in case of unbalanced write load, the fast (not loaded) nodes would be needlessly penalized with lower parallelism level, because the window size would be per-coordinator, not per partition (replica set). Am I right here? I wonder if using a penalty delay *after* processing the write request, would not be a better idea: in case of an unbalanced load, when one partition gets hammered, but most others are ok, it would slow down writes for that one partition (replica set), but would not affect latency of writes of other partitions. I'm for applyting the delay after, because then we already know the replica set and their load, as well as we don't need to keep data queued in memory, for it has already been written. The (simplified) process would look as follows: 1. the write request gets accepted by the coordinator 2. the write gets sent to proper replica nodes 3. the replicas that acknowledge the write, also reply with their current load information 4. received load information gets averaged (median) 5. when the CL is satisfied, but the load was high the coordinator puts some artificial delay before acknowledging the write to the client - of course small enough to not exceed the write timeout. Or if not possible / wise for some other reasons (e.g. holding memory), just tells the driver that load was high, you better slow down and the driver waits before processing the next request. WDYT? WDYT? Apply backpressure gently when overloaded with writes - Key: CASSANDRA-7937 URL: https://issues.apache.org/jira/browse/CASSANDRA-7937 Project: Cassandra Issue Type: Bug Components: Core Environment:
[jira] [Comment Edited] (CASSANDRA-7937) Apply backpressure gently when overloaded with writes
[ https://issues.apache.org/jira/browse/CASSANDRA-7937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14259967#comment-14259967 ] Piotr Kołaczkowski edited comment on CASSANDRA-7937 at 12/29/14 9:58 AM: - I like the idea of using parallelism level as a backpressure mechanism. That would have a nice positive effect of automatically reducing the amount of memory used for queuing the requests. However, I have a few concerns: # my biggest concern is, that even limiting a single client to one write at a time (window size = 1), might still be too fast, for some fast clients, if only row sizes are big enough, particularly when writing big cells of data, where big = hundreds of kB / single MBs per cell. Cassandra is extremely efficient at ingesting data into memtables. If it was faster than we're able to flush, then we still have a problem. # when not using TokenAware LBP, we can have a problem that in case of unbalanced write load, the fast (not loaded) nodes would be needlessly penalized with lower parallelism level, because the window size would be per-coordinator, not per partition (replica set). Am I right here? I wonder if using a penalty delay *after* processing the write request, would not be a better idea: in case of an unbalanced load, when one partition gets hammered, but most others are ok, it would slow down writes for that one partition (replica set), but would not affect latency of writes of other partitions. I'm for applyting the delay after, because then we already know the replica set and their load, as well as we don't need to keep data queued in memory, for it has already been written. The (simplified) process would look as follows: # the write request gets accepted by the coordinator # the write gets sent to proper replica nodes # the replicas that acknowledge the write, also reply with their current load information # received load information gets averaged (median) # when the CL is satisfied, but the load was high enough to be in the danger zone the coordinator puts some artificial delay before acknowledging the write to the client - of course small enough to not exceed the write timeout. Or if not possible / wise for some other reasons (e.g. holding memory), just tells the driver that load was high, you better slow down and the driver waits before processing the next request. WDYT? was (Author: pkolaczk): I like the idea of using parallelism level as a backpressure mechanism. That would have a nice positive effect of automatically reducing the amount of memory used for queuing the requests. However, I have a few concerns: # my biggest concern is, that even limiting a single client to one write at a time (window size = 1), might still be too fast, for some fast clients, if only row sizes are big enough, particularly when writing big cells of data, where big = hundreds of kB / single MBs per cell. Cassandra is extremely efficient at ingesting data into memtables. If it was faster than we're able to flush, then we still have a problem. # when not using TokenAware LBP, we can have a problem that in case of unbalanced write load, the fast (not loaded) nodes would be needlessly penalized with lower parallelism level, because the window size would be per-coordinator, not per partition (replica set). Am I right here? I wonder if using a penalty delay *after* processing the write request, would not be a better idea: in case of an unbalanced load, when one partition gets hammered, but most others are ok, it would slow down writes for that one partition (replica set), but would not affect latency of writes of other partitions. I'm for applyting the delay after, because then we already know the replica set and their load, as well as we don't need to keep data queued in memory, for it has already been written. The (simplified) process would look as follows: # the write request gets accepted by the coordinator # the write gets sent to proper replica nodes # the replicas that acknowledge the write, also reply with their current load information # received load information gets averaged (median) # when the CL is satisfied, but the load was high enough to be in the dangerous zone the coordinator puts some artificial delay before acknowledging the write to the client - of course small enough to not exceed the write timeout. Or if not possible / wise for some other reasons (e.g. holding memory), just tells the driver that load was high, you better slow down and the driver waits before processing the next request. WDYT? Apply backpressure gently when overloaded with writes - Key: CASSANDRA-7937 URL: https://issues.apache.org/jira/browse/CASSANDRA-7937 Project: Cassandra Issue Type: Bug Components: Core
[jira] [Comment Edited] (CASSANDRA-7937) Apply backpressure gently when overloaded with writes
[ https://issues.apache.org/jira/browse/CASSANDRA-7937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14242880#comment-14242880 ] Michaël Figuière edited comment on CASSANDRA-7937 at 12/11/14 6:37 PM: --- The StreamIDs, introduced in the native protocol to multiplex several pending requests on a single connection, could actually serve as a backpressure mechanism. Before protocol v2 we had just 128 IDs per connection with drivers typically allowing just a few connection per node. This therefore already acts as a throttling mechanism on the client side. With protocol v3 we've increased this limit but the driver still let the user define a value for the max requests per host that will have the same effect. A simple way the handle backpressure could therefore be to introduce a Window (similar as TCP Window) of the currently allowed concurrent requests for each client. Just like in TCP, the Window Size could be included in each response header to the client. This Window Size could then be adjusted using a magic formula to define, probably based on the load of each Stage of the Cassandra architecture, state of compaction, etc... I agree with [~jbellis]'s point: backpressure in a distributed system like Cassandra, with a coordinator fowarding traffic to replicas, is confusing. But in practice, most recent CQL Drivers now do Token Aware Balancing by default (since 2.0.2 in the Java Driver), which will send the request to the replicas for any PreparedStatement (expected to be used under the high pressure condition described here). So in this situation the backpressure information received by the client could be used properly, as it would just be understood by the client as a request to slow down for *this* particular replica, it could therefore pick another replica. Thus we end up with a system in which we avoid doing Load Shedding (which is a waste of time, bandwidth and workload) and that, I believe, could behave more smoothly when the cluster is overloaded. Note that this StreamID Window could be considered as a mandatory limit or just as a hint in the protocol specification. The driver could then adjust its strategy to use it or not depending on the settings or type of request. was (Author: mfiguiere): The StreamIDs, introduced in the native protocol to multiplex several pending requests on a single connection, could actually serve as a backpressure mechanism. Before protocol v2 we had just 128 IDs per connection with drivers typically allowing just a few connection per node. This therefore already acts as a throttling mechanism on the client side. With protocol v3 we've increased this limit but the driver still let the user define a value for the max requests per host that will have the same effect. A simple way the handle backpressure could therefore be to introduce a Window (similar as TCP Window) of the currently allowed concurrent requests for each client. Just like in TCP, the Window Size could be included in each response header to the client. This Window Size could then be adjusted using a magic formula to define, probably based on the load of each Stage of the Cassandra architecture, state of compaction, etc... I agree with [~jbellis]'s point: backpressure in a distributed system like Cassandra, with a coordinator fowarding traffic to replicas, is confusing. But in practice, most recent CQL Drivers now do Token Aware Balancing by default (since 2.0.2 in the Java Driver), which will send the queries to the replicas any PreparedStatement (expected to be used under the high pressure condition described here). So in this situation the backpressure information received by the client could be used properly, as it would just be understood by the client as a request to slow down for *this* particular replica, it could therefore pick another replica. Thus we end up with a system in which we avoid doing Load Shedding (which is a waste of time, bandwidth and workload) and that, I believe, could behave more smoothly when the cluster is overloaded. Note that this StreamID Window could be considered as a mandatory limit or just as a hint in the protocol specification. The driver could then adjust its strategy to use it or not depending on the settings or type of request. Apply backpressure gently when overloaded with writes - Key: CASSANDRA-7937 URL: https://issues.apache.org/jira/browse/CASSANDRA-7937 Project: Cassandra Issue Type: Bug Components: Core Environment: Cassandra 2.0 Reporter: Piotr Kołaczkowski Labels: performance When writing huge amounts of data into C* cluster from analytic tools like Hadoop or Apache Spark, we can see that often C* can't keep up with the load. This is
[jira] [Comment Edited] (CASSANDRA-7937) Apply backpressure gently when overloaded with writes
[ https://issues.apache.org/jira/browse/CASSANDRA-7937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14136794#comment-14136794 ] Benedict edited comment on CASSANDRA-7937 at 9/17/14 6:03 AM: -- Under these workload conditions, CASSANDRA-3852 is unlikely to help; it is more for reads. CASSANDRA-6812 will help for writes. It in no way affects the concept of load shedding or backpressure, just write latency spikes and hence timeouts. There are some issues with load shedding by itself, though, at least as it stands in Cassandra. Take your example cluster: A..E, with RF=3. Let's say B _and_ C are now load shedding. The result is that we see 1/5th of requests queue up and block in A,D and E, which within a short window results in all requests to those servers being blocked, as all rpc threads are waiting for results that will never come. Now, with explicit back pressure (whether it be blocking consumption of network messages or signalling the coordinator we are load shedding) we could detect this problem and start dropping those requests on the floor in A, D and E and continue to serve other requests. FTR, we do already have an accidental back pressure system built into Incoming/OutboundTcpConnection, with only negative consequences. If the remote server is blocked due to GC (or some other event), it will not consume its IncomingTcpConnection which will cause our coordinator's queue to that node to back up. This is something we should deal with, and introducing explicit back pressure of the same kind when a node cannot cope with its load would allow us to deal with both situations using the same mechanism. (We do sort of try to deal with this, but only as more messages arrive, which is unlikely to happen when our client rpc threads get filled up waiting for these responses) was (Author: benedict): Under these workload conditions, CASSANDRA-3852 is unlikely to help; it is more for reads. CASSANDRA-6812 will help for writes. It in no way affects the concept of load shedding or backpressure, just write latency spikes and hence timeouts. There are some issues with load shedding by itself, though, at least as it stands in Cassandra. Take your example cluster: A..E, with RF=3. Let's say B _and_ C are now load shedding. The result is that we see 1/5th of requests queue up and block in A,D and E, which within a short window results in all requests to those servers being blocked, as all rpc threads are waiting for results that will never come. Now, with explicit back pressure (whether it be blocking consumption of network messages or signalling the coordinator we are load shedding) we could detect this problem and start dropping those requests on the floor in A, D and E and continue to serve other requests. FTR, we do already have an accidental back pressure system built into Incoming/OutboundTcpConnection, with only negative consequences. If the remote server is blocked due to GC (or some other event), it will not consume its IncomingTcpConnection which will cause our coordinator's queue to that node to back up. This is something we should deal with, and introducing explicit back pressure of the same kind when a node cannot cope with its load would allow us to deal with both situations using the same mechanism. Apply backpressure gently when overloaded with writes - Key: CASSANDRA-7937 URL: https://issues.apache.org/jira/browse/CASSANDRA-7937 Project: Cassandra Issue Type: Bug Components: Core Environment: Cassandra 2.0 Reporter: Piotr Kołaczkowski Labels: performance When writing huge amounts of data into C* cluster from analytic tools like Hadoop or Apache Spark, we can see that often C* can't keep up with the load. This is because analytic tools typically write data as fast as they can in parallel, from many nodes and they are not artificially rate-limited, so C* is the bottleneck here. Also, increasing the number of nodes doesn't really help, because in a collocated setup this also increases number of Hadoop/Spark nodes (writers) and although possible write performance is higher, the problem still remains. We observe the following behavior: 1. data is ingested at an extreme fast pace into memtables and flush queue fills up 2. the available memory limit for memtables is reached and writes are no longer accepted 3. the application gets hit by write timeout, and retries repeatedly, in vain 4. after several failed attempts to write, the job gets aborted Desired behaviour: 1. data is ingested at an extreme fast pace into memtables and flush queue fills up 2. after exceeding some memtable fill threshold, C* applies adaptive rate limiting to writes - the more the buffers are
[jira] [Comment Edited] (CASSANDRA-7937) Apply backpressure gently when overloaded with writes
[ https://issues.apache.org/jira/browse/CASSANDRA-7937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14136381#comment-14136381 ] Jonathan Ellis edited comment on CASSANDRA-7937 at 9/16/14 10:30 PM: - How do you backpressure in a distributed system? Suppose a cluster has 5 nodes, A..E. Node B is overloaded. Client is talking to node A. We then have some conflicting goals: # Backpressuring from B to A does not help, it just pushes the problem into A's messagingservice # So, we really want A to make the client slow down # But, we do not want to restrict requests to CDE simply because B is overloaded. Backpressure is too blunt a tool to restrict requests to B alone I conclude that load shedding (dropping requests) is a better solution than backpressure for us. was (Author: jbellis): How do you backpressure in a distributed system? Suppose a cluster has 5 nodes, A..E. Node B is overloaded. Client is talking to node A. We do not want to restrict requests to CDE simply because B is overloaded. I conclude that load shedding (dropping requests) is a better solution than backpressure for us. Apply backpressure gently when overloaded with writes - Key: CASSANDRA-7937 URL: https://issues.apache.org/jira/browse/CASSANDRA-7937 Project: Cassandra Issue Type: Bug Components: Core Environment: Cassandra 2.0 Reporter: Piotr Kołaczkowski Labels: performance When writing huge amounts of data into C* cluster from analytic tools like Hadoop or Apache Spark, we can see that often C* can't keep up with the load. This is because analytic tools typically write data as fast as they can in parallel, from many nodes and they are not artificially rate-limited, so C* is the bottleneck here. Also, increasing the number of nodes doesn't really help, because in a collocated setup this also increases number of Hadoop/Spark nodes (writers) and although possible write performance is higher, the problem still remains. We observe the following behavior: 1. data is ingested at an extreme fast pace into memtables and flush queue fills up 2. the available memory limit for memtables is reached and writes are no longer accepted 3. the application gets hit by write timeout, and retries repeatedly, in vain 4. after several failed attempts to write, the job gets aborted Desired behaviour: 1. data is ingested at an extreme fast pace into memtables and flush queue fills up 2. after exceeding some memtable fill threshold, C* applies adaptive rate limiting to writes - the more the buffers are filled-up, the less writes/s are accepted, however writes still occur within the write timeout. 3. thanks to slowed down data ingestion, now flush can finish before all the memory gets used Of course the details how rate limiting could be done are up for a discussion. It may be also worth considering putting such logic into the driver, not C* core, but then C* needs to expose at least the following information to the driver, so we could calculate the desired maximum data rate: 1. current amount of memory available for writes before they would completely block 2. total amount of data queued to be flushed and flush progress (amount of data to flush remaining for the memtable currently being flushed) 3. average flush write speed -- This message was sent by Atlassian JIRA (v6.3.4#6332)