Re: Unexplainable spikes of requests latency

2018-12-12 Thread Nitan Kainth
DigestMismatchExceptions   --> could be due to data out of sync.Are you
running repairs?

On Wed, Dec 12, 2018 at 11:39 AM Виталий Савкин 
wrote:

> Hi everyone!
>
> Few times a day I see spikes of requests latencies on my cassandra
> clients. Usually 99thPercentile is below 100ms but that times it grows
> above 1 second.
> Type of request doesn't matter: different services are affected and I
> found that three absolutely identical requests (to the same partition key,
> issued in a three-second interval) completed in 1ms, 30ms and 1100ms. Also
> I found no correlation between spikes and patterns of load. G1 GC does not
> report any significant (>50ms) delays.
> Few suspicious things:
>
>- nodetool shows that there are dropped READs
>- there are DigestMismatchExceptions in logs
>- in tracing events I see that event "Executing single-partition query
>on *" sometimes happens right after "READ message received from /*.*.*.*"
>(in less than 100 micros) and sometimes after hundreds of milliseconds
>
> My cluster runs on six c5.2xlarge Amazon instances, data is stored on EBS.
> Cassandra version is 3.10.
> Any help in explaining this behavior is appreciated. I'm glad to share
> more details if needed.
>
> Thanks,
> Vitaliy Savkin.
>


Cassandra lucene secondary indexes

2018-12-12 Thread Brian Spindler
Hi all, we recently started using the cassandra-lucene secondary index
support that Instaclustr recently assumed ownership of, thank you btw!

We are experiencing a strange issue where adding/removing nodes fails and
the joining node is left hung with a compaction "Secondary index build" and
it just never completes.

We're running v3.11.3 of Cassandra and the plugin, has anyone experienced
this before?

It's a relatively small cluster ~6 nodes in our user acceptance environment
and so not a lot of load either.

Thanks!

-- 
-Brian


Unexplainable spikes of requests latency

2018-12-12 Thread Виталий Савкин
Hi everyone!

Few times a day I see spikes of requests latencies on my cassandra clients.
Usually 99thPercentile is below 100ms but that times it grows above 1
second.
Type of request doesn't matter: different services are affected and I found
that three absolutely identical requests (to the same partition key, issued
in a three-second interval) completed in 1ms, 30ms and 1100ms. Also I found
no correlation between spikes and patterns of load. G1 GC does not report
any significant (>50ms) delays.
Few suspicious things:

   - nodetool shows that there are dropped READs
   - there are DigestMismatchExceptions in logs
   - in tracing events I see that event "Executing single-partition query
   on *" sometimes happens right after "READ message received from /*.*.*.*"
   (in less than 100 micros) and sometimes after hundreds of milliseconds

My cluster runs on six c5.2xlarge Amazon instances, data is stored on EBS.
Cassandra version is 3.10.
Any help in explaining this behavior is appreciated. I'm glad to share more
details if needed.

Thanks,
Vitaliy Savkin.


Re: Cassandra single unreachable node causing total cluster outage

2018-12-12 Thread cclive1601你
Hi can you put some java stack ?

Agrawal, Pratik  于2018年12月11日周二 下午10:26写道:

> Hello all,
>
>
>
> I’ve been doing more analysis and I’ve few questions:
>
>
>
>1. We observed that most of the requests are blocked on NTR queue. I
>increased the queue size from 128 (default) to 1024 and this time the
>system does recover automatically (latencies go back to normal) without
>removing node from the cluster.
>2. Is there a way to fail fast the NTR requests rather than being
>blocked on the NTR queue when the queue is full?
>
>
>
> Thanks,
>
> Pratik
>
> *From: *"Agrawal, Pratik" 
> *Date: *Monday, December 3, 2018 at 11:55 PM
> *To: *"user@cassandra.apache.org" , Marc
> Selwan 
> *Cc: *Jeff Jirsa , Ben Slater <
> ben.sla...@instaclustr.com>
> *Subject: *Re: Cassandra single unreachable node causing total cluster
> outage
>
>
>
> Hello,
>
>
>
>1. Cassandra latencies spiked 5-6 times the normal. (Read and write
>both). The latencies were in higher single digit seconds.
>2. As I said in my previous email, we don’t bound the NTR threads and
>queue, the Cassandra nodes NTR queue started piling up and requests started
>getting blocked. 8 (mainly 4) out of 18 nodes in the cluster had NTR
>requests blocked.
>3. As a result of 1.) and 2.) the Cassandra system resources spiked
>up.(CPU, IO, system load, # SStables (10 times, 250->2500), Memtable switch
>count, Pending compactions etc.)
>4. One interesting thing we observed was the read calls with quorum
>consistency were not having any issues (high latencies and requests backing
>up) while the read calls with serial consistency were consistently failing
>on client side due to C* timeouts.
>5. We used Nodetool removenode command to remove the ndoe from the
>cluster. The node wasn’t reachable (IP down).
>
>
>
> One thing which we don’t understand is as soon as we remove the dead node
> from the cluster the system recovers within a minute(s). My main question
> is, is there a bug in C* with respect to Cassandra serial consistency calls
> getting blocked on some dead node resource and the resources getting
> released as soon as the dead node is removed from the cluster OR are we
> hitting some limit here?
>
>
>
> Also, as the cluster size increases the impact of the dead node decreases
> on serial consistency read decreases (as in the latency spike up for a
> minute or two and the system automatically recovers).
>
>
>
> Any pointers?
>
>
>
> Thanks,
>
> Pratik
>
>
>
> *From: *Marc Selwan 
> *Reply-To: *"user@cassandra.apache.org" 
> *Date: *Monday, December 3, 2018 at 1:09 AM
> *To: *"user@cassandra.apache.org" 
> *Cc: *Jeff Jirsa , Ben Slater <
> ben.sla...@instaclustr.com>
> *Subject: *Re: Cassandra single unreachable node causing total cluster
> outage
>
>
>
> Ben's question is a good one - What are the exact symptoms you're
> experiencing? Is it latency spikes? Nodes flapping? That'll help us figure
> out where to look.
>
>
>
> When you removed the down node, which command did you use?
>
>
>
> Best,
>
> Marc
>
>
>
> On Sun, Dec 2, 2018 at 1:36 PM Agrawal, Pratik 
> wrote:
>
> One other thing I forgot to add:
>
>
>
> native_transport_max_threads: 128
>
>
>
> we have commented this setting out, should we bound this? I am planning to
> experiment with this setting to bound it.
>
>
>
> Thanks,
> Pratik
>
>
>
> *From: *"Agrawal, Pratik" 
> *Date: *Sunday, December 2, 2018 at 4:33 PM
> *To: *"user@cassandra.apache.org" , Jeff Jirsa
> , Ben Slater 
>
>
> *Subject: *Re: Cassandra single unreachable node causing total cluster
> outage
>
>
>
> I looked into some of the logs and I saw that at the time of the event the
> Native requests started getting blocked.
>
>
>
> e.g.
>
>  [INFO] org.apache.cassandra.utils.StatusLogger:
> Native-Transport-Requests   128   133   5179582116
>   19114
>
>
>
> The number of blocked requests kept on increasing over the period of 5
> minutes and became constant.
>
>
>
> As soon as we remove the dead node from the cluster, things recover pretty
> quickly and cluster becomes stable.
>
>
>
> Any pointers on what to look for debugging why requests are getting
> blocked when a nodes goes down??
>
>
>
> Also, one other thing to note that we reproduced this scenario in our test
> environment and as we scale up the cluster the cluster automatically
> recover in matter of minutes without removing the node from the cluster. It
> seems like we are reaching some vertical scalability limit (maybe because
> of our configuration).
>
>
>
>
>
> Thanks,
>
> Pratik
>
> *From: *Jeff Jirsa 
> *Reply-To: *"user@cassandra.apache.org" 
> *Date: *Tuesday, November 27, 2018 at 9:37 PM
> *To: *"user@cassandra.apache.org" 
> *Subject: *Re: Cassandra single unreachable node causing total cluster
> outage
>
>
>
> Could also be the app not detecting the host is down and it keeps trying
> to use it as a coordinator
>
>
>
> --
>
> Jeff Jirsa
>
>
>
>
> On Nov 27, 2018, at