Bursts of Thrift threads make cluster unresponsive

Dmitry Simonov Thu, 27 Jun 2019 01:52:01 -0700

Hello!

We've met several times the following problem.


Cassandra cluster (5 nodes) becomes unresponsive for ~30 minutes:
- all CPUs have 100% load (normally we have LA 5 on 16-cores machine)
- cassandra's threads count raises from 300 to 1300 - 2000,most of them are
Thrift threads in java.net.SocketInputStream.socketRead0(Native Method)
method, count of other threads doesn't increase
- some Read messages are dropped
- read latency (p99.9) increases to 20-30 seconds
- there are up to 32 active Read Tasks, up to 3k - 6k pending Read Tasks

Problem starts synchronously on all nodes of cluster.
I cannot tie this problem with increased load from clients ("read rate"
does't increase during the problem).
Also looks like there is no problem with disks (I/O latencies are OK).

Could anybody please give some advice in further troubleshooting?

-- 
Best Regards,
Dmitry Simonov

Bursts of Thrift threads make cluster unresponsive

Reply via email to