[ 
https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15356800#comment-15356800
 ] 

Stefania commented on CASSANDRA-9318:
-------------------------------------

Well, I read all of the discussion above and it took perhaps longer than to 
actually review the code! :) If I understood correctly, what we are trying to 
achieve with this new approach, is reducing the number of dropped mutations 
rather than preventing OOM but, in a way that doesn't cause OOM by invoking a 
rate-limiter-acquire in a Netty shared worker pool thread, which would slow 
down our ability to handle new client requests, as well as throw OEs once in a 
while when a low water-mark is reached. Is this a fair description? Assuming we 
can prove this in a larger scenario, I am +1 on the approach but I would think 
it should go to trunk rather than 3.0? In the CCM test performed with two 
nodes, what was the difference in write timeout?

In terms of testing, here is what I would suggest (cc [~cassandra-te] for any 
help they may be able to provide):

* verify we can reduce the number of dropped mutations in a larger (5-10 nodes) 
cluster with multiple clients writing simultaneously
* some cstar perf tests to ensure ops per second are not degraded, both read 
and writes
* the dtests should be run with and without backpressure enabled
* we should do a bulk load test, for example for cqlsh COPY FROM

As for the code review, I've addressed some nits, mostly comments and trailing 
spaces, directly 
[here|https://github.com/stef1927/cassandra/commit/334509c10a5f4f759550fb64902aabc8c4a4f40c].
 The code is really of excellent quality and very well unit tested, so I didn't 
really find much, and they are mostly suggestions. I will however have another 
pass next week, now that I am more familiar with the feature.

h5. conf/cassandra.yaml

I've edited the comments slightly, mostly to spell out for users why they may 
need a back pressure strategy, and to put the emphasis on the default 
implementation first, then say how to create new strategies. As you mentioned, 
new strategies may require new state parameters anyway, so it's not clear how 
easy it will be to add new strategies. You may be able to improve on my 
comments further.

Should we perhaps group all settings under something like back_pressure_options?

back_pressure_timeout_override: should we call it for what it is, the 
back_pressure_window_size and recommend that they increase the write timeout in 
the comments, or is an override of the write timeout necessary for correctness?

h5. src/java/org/apache/cassandra/config/Config.java

I moved the creation of the default parameterized class to 
RateBasedBackPressure, to remove some pollution of constants. 

h5. src/java/org/apache/cassandra/net/BackPressureState.java

Should we use a dedicated package? It would require a couple of setters and 
getters in the state to keep the properties non public, which I would prefer 
anyway.

h5. src/java/org/apache/cassandra/net/BackPressureStrategy.java

As above, should we use a dedicated package?

h5. src/java/org/apache/cassandra/net/RateBasedBackPressure.java

OK - I've changed the code slightly to calculate backPressureWindowHalfSize in 
seconds only once per call.

h5. src/java/org/apache/cassandra/utils/SystemTimeSource.java

Depending on the precision, in trunk there is an approximate time service with 
a precision of 10 ms, which is faster than calling System.currentTimeMillis()

h5. src/java/org/apache/cassandra/utils/TestRateLimiter.java

Is this still used?

h5. src/java/org/apache/cassandra/utils/TestTimeSource.java

I think we need to move it somewhere with the test sources.

> Bound the number of in-flight requests at the coordinator
> ---------------------------------------------------------
>
>                 Key: CASSANDRA-9318
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-9318
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Local Write-Read Paths, Streaming and Messaging
>            Reporter: Ariel Weisberg
>            Assignee: Sergio Bossa
>         Attachments: backpressure.png, limit.btm, no_backpressure.png
>
>
> It's possible to somewhat bound the amount of load accepted into the cluster 
> by bounding the number of in-flight requests and request bytes.
> An implementation might do something like track the number of outstanding 
> bytes and requests and if it reaches a high watermark disable read on client 
> connections until it goes back below some low watermark.
> Need to make sure that disabling read on the client connection won't 
> introduce other issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to