Hi can you put some java stack ?
Agrawal, Pratik 于2018年12月11日周二 下午10:26写道:
> Hello all,
>
>
>
> I’ve been doing more analysis and I’ve few questions:
>
>
>
>1. We observed that most of the requests are blocked on NTR queue. I
>increased the queue size from 128 (default) to 1024 and this time the
>system does recover automatically (latencies go back to normal) without
>removing node from the cluster.
>2. Is there a way to fail fast the NTR requests rather than being
>blocked on the NTR queue when the queue is full?
>
>
>
> Thanks,
>
> Pratik
>
> *From: *"Agrawal, Pratik"
> *Date: *Monday, December 3, 2018 at 11:55 PM
> *To: *"user@cassandra.apache.org" , Marc
> Selwan
> *Cc: *Jeff Jirsa , Ben Slater <
> ben.sla...@instaclustr.com>
> *Subject: *Re: Cassandra single unreachable node causing total cluster
> outage
>
>
>
> Hello,
>
>
>
>1. Cassandra latencies spiked 5-6 times the normal. (Read and write
>both). The latencies were in higher single digit seconds.
>2. As I said in my previous email, we don’t bound the NTR threads and
>queue, the Cassandra nodes NTR queue started piling up and requests started
>getting blocked. 8 (mainly 4) out of 18 nodes in the cluster had NTR
>requests blocked.
>3. As a result of 1.) and 2.) the Cassandra system resources spiked
>up.(CPU, IO, system load, # SStables (10 times, 250->2500), Memtable switch
>count, Pending compactions etc.)
>4. One interesting thing we observed was the read calls with quorum
>consistency were not having any issues (high latencies and requests backing
>up) while the read calls with serial consistency were consistently failing
>on client side due to C* timeouts.
>5. We used Nodetool removenode command to remove the ndoe from the
>cluster. The node wasn’t reachable (IP down).
>
>
>
> One thing which we don’t understand is as soon as we remove the dead node
> from the cluster the system recovers within a minute(s). My main question
> is, is there a bug in C* with respect to Cassandra serial consistency calls
> getting blocked on some dead node resource and the resources getting
> released as soon as the dead node is removed from the cluster OR are we
> hitting some limit here?
>
>
>
> Also, as the cluster size increases the impact of the dead node decreases
> on serial consistency read decreases (as in the latency spike up for a
> minute or two and the system automatically recovers).
>
>
>
> Any pointers?
>
>
>
> Thanks,
>
> Pratik
>
>
>
> *From: *Marc Selwan
> *Reply-To: *"user@cassandra.apache.org"
> *Date: *Monday, December 3, 2018 at 1:09 AM
> *To: *"user@cassandra.apache.org"
> *Cc: *Jeff Jirsa , Ben Slater <
> ben.sla...@instaclustr.com>
> *Subject: *Re: Cassandra single unreachable node causing total cluster
> outage
>
>
>
> Ben's question is a good one - What are the exact symptoms you're
> experiencing? Is it latency spikes? Nodes flapping? That'll help us figure
> out where to look.
>
>
>
> When you removed the down node, which command did you use?
>
>
>
> Best,
>
> Marc
>
>
>
> On Sun, Dec 2, 2018 at 1:36 PM Agrawal, Pratik
> wrote:
>
> One other thing I forgot to add:
>
>
>
> native_transport_max_threads: 128
>
>
>
> we have commented this setting out, should we bound this? I am planning to
> experiment with this setting to bound it.
>
>
>
> Thanks,
> Pratik
>
>
>
> *From: *"Agrawal, Pratik"
> *Date: *Sunday, December 2, 2018 at 4:33 PM
> *To: *"user@cassandra.apache.org" , Jeff Jirsa
> , Ben Slater
>
>
> *Subject: *Re: Cassandra single unreachable node causing total cluster
> outage
>
>
>
> I looked into some of the logs and I saw that at the time of the event the
> Native requests started getting blocked.
>
>
>
> e.g.
>
> [INFO] org.apache.cassandra.utils.StatusLogger:
> Native-Transport-Requests 128 133 5179582116
> 19114
>
>
>
> The number of blocked requests kept on increasing over the period of 5
> minutes and became constant.
>
>
>
> As soon as we remove the dead node from the cluster, things recover pretty
> quickly and cluster becomes stable.
>
>
>
> Any pointers on what to look for debugging why requests are getting
> blocked when a nodes goes down??
>
>
>
> Also, one other thing to note that we reproduced this scenario in our test
> environment and as we scale up the cluster the cluster automatically
> recover in matter of minutes without removing the node from the cluster. It
> seems like we are reaching some vertical scalability limit (maybe because
> of our configuration).
>
>
>
>
>
> Thanks,
>
> Pratik
>
> *From: *Jeff Jirsa
> *Reply-To: *"user@cassandra.apache.org"
> *Date: *Tuesday, November 27, 2018 at 9:37 PM
> *To: *"user@cassandra.apache.org"
> *Subject: *Re: Cassandra single unreachable node causing total cluster
> outage
>
>
>
> Could also be the app not detecting the host is down and it keeps trying
> to use it as a coordinator
>
>
>
> --
>
> Jeff Jirsa
>
>
>
>
> On Nov 27, 2018, at