RE: [EXTERNAL] Re: Bursts of Thrift threads make cluster unresponsive

Durity, Sean R Fri, 28 Jun 2019 08:33:43 -0700

This sounds like a bad query or large partition. If a large partition is 
requested on multiple nodes (because of consistency level), it will pressure 
all those replica nodes. Then, as the cluster tries to adjust the rest of the 
load, the other nodes can get overwhelmed, too.

Look at cfstats to see if you have some large partitions. You may also see them 
as warnings in the system.log when they are getting compacted.

Also check for any ALLOW FILTERING queries in the code (or slow query stats, if 
you have them)

Sean

From: Dmitry Simonov <dimmobor...@gmail.com>
Sent: Thursday, June 27, 2019 5:22 PM
To: user@cassandra.apache.org
Subject: [EXTERNAL] Re: Bursts of Thrift threads make cluster unresponsive

> Is there an order in which the events you described happened, or is the order 
> with which you presented them the order you notice things going wrong?

At first, threads count (Thrift) start increasing.
After 2 or 3 minutes they consume all CPU cores.
After that, simultaneously: message drops occur, read latency increases, active 
read tasks are noticed.

пт, 28 июн. 2019 г. в 01:40, Avinash Mandava 
<avin...@vorstella.com<mailto:avin...@vorstella.com>>:
Yeah i skimmed too fast, don't add more work if CPU is pegged, and if using 
thrift protocol NTR would not have values.

Is there an order in which the events you described happened, or is the order 
with which you presented them the order you notice things going wrong?

On Thu, Jun 27, 2019 at 1:29 PM Dmitry Simonov 
<dimmobor...@gmail.com<mailto:dimmobor...@gmail.com>> wrote:
Thanks for your reply!

> Have you tried increasing concurrent reads until you see more activity in 
> disk?
When problem occurs, freshly created 1.2k - 2k Thrift threads consume all CPU 
on all cores.
Does increasing concurrent reads may help in this situation?

> org.apache.cassandra.metrics.type=ThreadPools.path=transport.scope=Native-Transport-Requests.name=TotalBlockedTasks.Count
This metric is 0 at all cluster nodes.

пт, 28 июн. 2019 г. в 00:34, Avinash Mandava 
<avin...@vorstella.com<mailto:avin...@vorstella.com>>:
Have you tried increasing concurrent reads until you see more activity in disk? 
If you've always got 32 active reads and high pending reads it could just be 
dropping the reads because the queues are saturated. Could be artificially 
bottlenecking at the C* process level.

Also what does this metric show over time:

org.apache.cassandra.metrics.type=ThreadPools.path=transport.scope=Native-Transport-Requests.name=TotalBlockedTasks.Count

On Thu, Jun 27, 2019 at 1:52 AM Dmitry Simonov 
<dimmobor...@gmail.com<mailto:dimmobor...@gmail.com>> wrote:
Hello!

We've met several times the following problem.

Cassandra cluster (5 nodes) becomes unresponsive for ~30 minutes:
- all CPUs have 100% load (normally we have LA 5 on 16-cores machine)
- cassandra's threads count raises from 300 to 1300 - 2000,most of them are 
Thrift threads in java.net.SocketInputStream.socketRead0(Native Method) method, 
count of other threads doesn't increase
- some Read messages are dropped
- read latency (p99.9) increases to 20-30 seconds
- there are up to 32 active Read Tasks, up to 3k - 6k pending Read Tasks

Problem starts synchronously on all nodes of cluster.
I cannot tie this problem with increased load from clients ("read rate" does't 
increase during the problem).
Also looks like there is no problem with disks (I/O latencies are OK).

Could anybody please give some advice in further troubleshooting?

--
Best Regards,
Dmitry Simonov

--
www.vorstella.com<https://urldefense.proofpoint.com/v2/url?u=http-3A__www.vorstella.com&d=DwMFaQ&c=MtgQEAMQGqekjTjiAhkudQ&r=aC_gxC6z_4f9GLlbWiKzHm1vucZTtVYWDDvyLkh8IaQ&m=IrhXQuIn8JGa-Vinu6ypOlCQ9KNdTGRDGYGH493oG2Y&s=SD-Q3PZga9maMsMZRNkaxNOWFmZIn2EXKO6TGbF6Qe8&e=>
408 691 8402

--
Best Regards,
Dmitry Simonov

--
www.vorstella.com<https://urldefense.proofpoint.com/v2/url?u=http-3A__www.vorstella.com&d=DwMFaQ&c=MtgQEAMQGqekjTjiAhkudQ&r=aC_gxC6z_4f9GLlbWiKzHm1vucZTtVYWDDvyLkh8IaQ&m=IrhXQuIn8JGa-Vinu6ypOlCQ9KNdTGRDGYGH493oG2Y&s=SD-Q3PZga9maMsMZRNkaxNOWFmZIn2EXKO6TGbF6Qe8&e=>
408 691 8402

--
Best Regards,
Dmitry Simonov

________________________________

The information in this Internet Email is confidential and may be legally 
privileged. It is intended solely for the addressee. Access to this Email by 
anyone else is unauthorized. If you are not the intended recipient, any 
disclosure, copying, distribution or any action taken or omitted to be taken in 
reliance on it, is prohibited and may be unlawful. When addressed to our 
clients any opinions or advice contained in this Email are subject to the terms 
and conditions expressed in any applicable governing The Home Depot terms of 
business or client engagement letter. The Home Depot disclaims all 
responsibility and liability for the accuracy and content of this attachment 
and for any damages or losses arising from any inaccuracies, errors, viruses, 
e.g., worms, trojan horses, etc., or other items of a destructive nature, which 
may be contained in this attachment and shall not be liable for direct, 
indirect, consequential or special damages in connection with this e-mail 
message or its attachment.

RE: [EXTERNAL] Re: Bursts of Thrift threads make cluster unresponsive

Reply via email to