[
https://issues.apache.org/jira/browse/CASSANDRA-19534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jon Haddad updated CASSANDRA-19534:
-----------------------------------
Attachment: screenshot-5.png
> unbounded queues in native transport requests lead to node instability
> ----------------------------------------------------------------------
>
> Key: CASSANDRA-19534
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19534
> Project: Cassandra
> Issue Type: Bug
> Components: Legacy/Local Write-Read Paths
> Reporter: Jon Haddad
> Assignee: Alex Petrov
> Priority: Normal
> Fix For: 4.1.x, 5.0-rc, 5.x
>
> Attachments: Scenario 1 - QUEUE + Backpressure.jpg, Scenario 1 -
> QUEUE.jpg, Scenario 1 - Stock.jpg, Scenario 2 - QUEUE + Backpressure.jpg,
> Scenario 2 - QUEUE.jpg, Scenario 2 - Stock.jpg, ci_summary.html,
> screenshot-1.png, screenshot-2.png, screenshot-3.png, screenshot-4.png,
> screenshot-5.png
>
> Time Spent: 20m
> Remaining Estimate: 0h
>
> When a node is under pressure, hundreds of thousands of requests can show up
> in the native transport queue, and it looks like it can take way longer to
> timeout than is configured. We should be shedding load much more
> aggressively and use a bounded queue for incoming work. This is extremely
> evident when we combine a resource consuming workload with a smaller one:
> Running 5.0 HEAD on a single node as of today:
> {noformat}
> # populate only
> easy-cass-stress run RandomPartitionAccess -p 100 -r 1
> --workload.rows=100000 --workload.select=partition --maxrlat 100 --populate
> 10m --rate 50k -n 1
> # workload 1 - larger reads
> easy-cass-stress run RandomPartitionAccess -p 100 -r 1
> --workload.rows=100000 --workload.select=partition --rate 200 -d 1d
> # second workload - small reads
> easy-cass-stress run KeyValue -p 1m --rate 20k -r .5 -d 24h{noformat}
> It appears our results don't time out at the requested server time either:
>
> {noformat}
> Writes Reads
> Deletes Errors
> Count Latency (p99) 1min (req/s) | Count Latency (p99) 1min (req/s) |
> Count Latency (p99) 1min (req/s) | Count 1min (errors/s)
> 950286 70403.93 634.77 | 789524 70442.07 426.02 |
> 0 0 0 | 9580484 18980.45
> 952304 70567.62 640.1 | 791072 70634.34 428.36 |
> 0 0 0 | 9636658 18969.54
> 953146 70767.34 640.1 | 791400 70767.76 428.36 |
> 0 0 0 | 9695272 18969.54
> 956833 71171.28 623.14 | 794009 71175.6 412.79 |
> 0 0 0 | 9749377 19002.44
> 959627 71312.58 656.93 | 795703 71349.87 435.56 |
> 0 0 0 | 9804907 18943.11{noformat}
>
> After stopping the load test altogether, it took nearly a minute before the
> requests were no longer queued.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]