[ 
https://issues.apache.org/jira/browse/CASSANDRA-19534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17840058#comment-17840058
 ] 

Alex Petrov commented on CASSANDRA-19534:
-----------------------------------------

The main change is the introduction of (currently implicit) configurable 
{_}native request deadline{_}. No request, read or write, will be allowed to 
prolong its execution beyond this deadline. Some of the hidden places that 
would allow requests to stay overdue were local executor runnables, 
replica-side writes, and hints. Default is 12 seconds, since this is how much 
time 3.x driver (which I believe is still the most used version in the 
community) waits until removing its handlers after which any response from the 
server will just be ignored. Now, there is an _option_ to enable expiration 
based on the queue time, which will be _disabled_ by default to preserve 
existing semantics, but my tests have shown enabling it only has positive 
effects. We will try it out cautiously in different clusters over the next 
months and will see if tests match up with real loads before we change any of 
the defaults.

So by default behaviour will be as follows:
 # If request has spent more than 12 seconds in the NATIVE queue, we throw 
Overloaded exception back to the client. This timeout used to be max of 
read/write/range/counter rpc timeout.
 # If requests has spent less than 12 seconds, it is allowed to execute; any 
request issued by the coordinator can live:
 ## _either_ {{Verb.timeout}} number of milliseconds,
 ## _or_ up to the up to the native request deadline, as measured from the time 
when the request was admitted to the coordinator's NATIVE queue, whichever one 
of these is happening earlier.

Example 1, read timeout is 5 seconds:
 # Client sends a request; request spends 6 seconds in the NATIVE queue
 # Coordinator issues requests to replicas; two replicas respond within 3 
seconds
 # Coordinator responds to the client with success

Example 2, read timeout is 5 seconds:
 # Client sends a request; request spends 6 seconds in the NATIVE queue
 # Coordinator issues requests to replicas; one replica responds within 3 
seconds; other replicas fail to respond within 5 seconds of read timeout
 # Coordinator responds to the client with read timeout (preserves current 
behaviour)

Example 3, read timeout is 5 seconds:
 # Client sends a request; request spends 10 seconds in the NATIVE queue
 # Coordinator issues requests to replicas; all replicas fail to respond within 
2 seconds
 # Coordinator responds to the client with read timeout; if messages are still 
in queue on replicas, they will get dropped before processing

There will be a _new_ metric that shows how many of the timeouts would have 
been “blind timeouts” previously. I.e. client _would_ register them as 
timeouts, but we as server-side operators would be oblivious to them. This 
metric will keep us collectively motivated even if we see there is a slight 
uptick in timeouts after committing the patch.

Lastly, there is an option to say how much of the 12 seconds client requests 
are allowed to spend in the native queue. You can say that if there is a client 
request that has spent 80% of its max 12 seconds in the native queue, we start 
applying backpressure to the client socket (or throwing overloaded exception, 
depending on the value of {{{}native_transport_throw_on_overload{}}}). We have 
to be careful with enabling this one, since my tests have shown that while we 
see fewer timeouts server side, clients see more timeouts, because part of the 
time they consider “request time” is now spent somewhere in TCP queues, which 
we can not account for.
h3. New Configuration Params
h3. cql_start_time

Configures what is considered to be a base for the replica-side timeout. This 
has actually existed before, it is now actually safe to enable. It still 
defaults to {{REQUEST}} (processing start time is taken as a timeout base), and 
an alternative is {{QUEUE}} (queue admission time is taken as a timeout base). 
Unfortunately, there is no consistent view of the timeout base in the 
community: some people think that server-side read/write timeouts are how much 
time _replicas_ have to respond to coordinator. Some believe they mean how much 
time _coordinator_ has to respond to the client. This patch is agnostic to 
these beliefs. 
h3. native_transport_throw_on_overload

Whether we should apply backpressure to client (i.e. stop reading from the 
socket), or throw Overloaded exception. Default is socket backpressure, and 
this is probably fine for now. In principle, this can also be set by the client 
on per-connection basis via protocol options. However, 3.x series of the driver 
do not have this addition implemented, so in practice this is not really used. 
If used, setting from the client takes precedence.
h3. native_transport_timeout_in_ms

The absolute maximum amount of time the server has to respond to client. No 
work related to client-side request will be done after that period elapses. 
Default is 100 seconds, which is unreasonably high, but not unbounded. In 
practice, we should use at most 12 seconds.
h3. native_transport_queue_max_item_age_threshold

Works in conjunction with {{{}native_transport_timeout_in_ms{}}}. How much of 
the maximum time the oldest request in the native queue is allowed to spend in 
the queue before we start applying backpressure (or throwing overloaded 
exception, depending on what {{native_transport_throw_on_overload}} is set to). 
Default is “all the time needed”. We should set it to 0.5-0.8 (50-80% of 12 
seconds). But I would leave it disabled at least until there's more testing 
done, since it hides the time request spent in TCP queues before it got to us.
h3. native_transport_\{min|max}_backoff_on_queue_overload_ms

If we start applying backpressure (see {{{}native_transport_timeout_in_ms{}}}), 
what is the minimum value. If the queue fills up above 
{{{}native_transport_queue_max_item_age_threshold{}}}, in other words, request 
has been sitting more than a % of {{native_transport_timeout_in_ms}} in the 
queue, we start an incident.

Incident starts by marking the incident time and raising severity level to 1. 
Each time we observe an old item in the head of the queue, we first bump then 
number of times we have applied the backpressure. After bumping it 10 times, we 
increase severity level by

Backpressure delay applied to the client socket is computed by multiplying the 
severity level by the minimum delay.

If we have not seen old requests in the head of the queue for 1 second, we 
close the incident.

If the queue remains saturated for a prolonged period, the amount of delay will 
increase in proportion to the request rate as appliedTimes & severityLevel are 
incremented. If no new requests are considered overloaded in this way for a 
second, the incident will be reset and so the delay will drop back down to 
minimum delay.
h2. Newly Introduced Metrics
 * {{{}ClientMetrics/{}}}{{{}ConnectionPaused{}}} number of times client 
connection was paused due to backpressure.
 * {{ThreadPoolMetrics/OldestTaskQueueTime}} the age (in milliseconds) of the 
oldest task in the given queue.
 * {{{}ClientMetrics/{}}}{{{}Queued{}}} - for how long (in nanoseconds) the 
item was queued before processing.

h2. Testing and Examples

Scenario 1: constant overload
{code:java}
             Stock | QUEUE | QUEUE + backpressure
success    | 595   | 56179 | 21540
timedOut   | 101180| 45647 | 80310
overloaded | 66    | 0     | 0 {code}
Scenario 2: sudden burst, followed by the drop
{code:java}
             Stock | QUEUE  | QUEUE+backpressure
success    | 46305 | 41087  | 46021
timedOut   | 5016  | 10121  | 5204
overloaded | 0     | 5204   | 0 {code}
How to interpret the results: overloaded cluster eventually falls into the 
state where it does not serve any good traffic at all (in real life continue 
serving more traffic, but clients would experience a substantial number of 
timeouts). It is very busy and spinning on all gears, all queues are full. 
However, because timeout base is “when we began processing the request” 
(disregarding the potentially long queue time), we serve no (almost) good 
traffic. Above, you can see that the number of successes in the “old” row is 
very little. This is a number of requests that we are able to complete 
successfully after increasing 50ms of artificial latency in the READ verb 
handler.

During overload, we now instead shed bad (timed out) requests and do not 
attempt to serve them. In addition to this, we apply backpressure to the 
clients to attempt to push them back a little.

While in the cluster under constant overload pressure this is just “the best we 
can do”, in real life such events are rare and short, so all we need to do is 
to not allow a sudden short-time burst to stay around for long enough for the 
cluster to serve less good traffic.

You can see that backpressure does not add much on top of the QUEUE; 
server-side metrics will be slightly better, but client-side, we will keep 
timing out since client does not know if the request was paused because of the 
server-side queue delay, or request was buffered in the TCP queue. We still 
believe that we do need client backpressure (or load shedding), since in case 
of a sudden burst in requests that eventually tapers off, backpressure does 
have a positive effect.

There may be a _slight_ change in metrics after that change, but added time 
will be constant and very insignificant, maybe even not significant enough for 
the operators to notice. Previously, {{nanoTime}} for metric calculation was 
grabbed _in the middle of the request_ on the coordinator side. Now, we will 
track entire coordinated path. But this is just a few method calls both on the 
read and write paths.

> unbounded queues in native transport requests lead to node instability
> ----------------------------------------------------------------------
>
>                 Key: CASSANDRA-19534
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-19534
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Legacy/Local Write-Read Paths
>            Reporter: Jon Haddad
>            Assignee: Alex Petrov
>            Priority: Normal
>             Fix For: 5.0-rc, 5.x
>
>
> When a node is under pressure, hundreds of thousands of requests can show up 
> in the native transport queue, and it looks like it can take way longer to 
> timeout than is configured.  We should be shedding load much more 
> aggressively and use a bounded queue for incoming work.  This is extremely 
> evident when we combine a resource consuming workload with a smaller one:
> Running 5.0 HEAD on a single node as of today:
> {noformat}
> # populate only
> easy-cass-stress run RandomPartitionAccess -p 100  -r 1 
> --workload.rows=100000 --workload.select=partition --maxrlat 100 --populate 
> 10m --rate 50k -n 1
> # workload 1 - larger reads
> easy-cass-stress run RandomPartitionAccess -p 100  -r 1 
> --workload.rows=100000 --workload.select=partition --rate 200 -d 1d
> # second workload - small reads
> easy-cass-stress run KeyValue -p 1m --rate 20k -r .5 -d 24h{noformat}
> It appears our results don't time out at the requested server time either:
>  
> {noformat}
>                  Writes                                  Reads                
>                   Deletes                       Errors
>   Count  Latency (p99)  1min (req/s) |   Count  Latency (p99)  1min (req/s) | 
>   Count  Latency (p99)  1min (req/s) |   Count  1min (errors/s)
>  950286       70403.93        634.77 |  789524       70442.07        426.02 | 
>       0              0             0 | 9580484         18980.45
>  952304       70567.62         640.1 |  791072       70634.34        428.36 | 
>       0              0             0 | 9636658         18969.54
>  953146       70767.34         640.1 |  791400       70767.76        428.36 | 
>       0              0             0 | 9695272         18969.54
>  956833       71171.28        623.14 |  794009        71175.6        412.79 | 
>       0              0             0 | 9749377         19002.44
>  959627       71312.58        656.93 |  795703       71349.87        435.56 | 
>       0              0             0 | 9804907         18943.11{noformat}
>  
> After stopping the load test altogether, it took nearly a minute before the 
> requests were no longer queued.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to