Timestamp of Last Repair

2018-12-11 Thread Fred Habash
We are trying to detect a scenario where some of our smaller clusters go
un-repaired for extended periods of times mostly due to defects in
deployment pipelines or human errors.

We would like to automate a check for clusters where nodes that go
un-repaired for more than 7 days, to shoot out an exception alert.

The 'Repaied at' field emits a long integer. I'm not sure if this can be
converted to a timestamp. If not, is there an internal dictionary table in
C* that captures repair history? If not, again, can this be done at all?



Thank you


Blocked NTR request

2018-12-11 Thread Guan Sun
Hi,

I'm using Cassandra 2.2.8 with default NTR queque configurations (
max_queued_native_transtport_requests = 128, native_transport_max_threads =
128), and from the metrics I'm seeing some native transport requests are
being blocked.

I'm trying to understand what happens to the blocked native transport
requests, will these requests be rejected immediately so client get
exceptions/errors from Cassandra? or will these requests be waiting
somewhere outside of the queue and will be added to the NTR queue later? or
will these requests simply timeout after some time?

Any help would be appreciated!

Thanks,
Guan


Re: Cassandra single unreachable node causing total cluster outage

2018-12-11 Thread Agrawal, Pratik
Hello all,

I’ve been doing more analysis and I’ve few questions:


  1.  We observed that most of the requests are blocked on NTR queue. I 
increased the queue size from 128 (default) to 1024 and this time the system 
does recover automatically (latencies go back to normal) without removing node 
from the cluster.
  2.  Is there a way to fail fast the NTR requests rather than being blocked on 
the NTR queue when the queue is full?

Thanks,
Pratik
From: "Agrawal, Pratik" 
Date: Monday, December 3, 2018 at 11:55 PM
To: "user@cassandra.apache.org" , Marc Selwan 

Cc: Jeff Jirsa , Ben Slater 
Subject: Re: Cassandra single unreachable node causing total cluster outage

Hello,


  1.  Cassandra latencies spiked 5-6 times the normal. (Read and write both). 
The latencies were in higher single digit seconds.
  2.  As I said in my previous email, we don’t bound the NTR threads and queue, 
the Cassandra nodes NTR queue started piling up and requests started getting 
blocked. 8 (mainly 4) out of 18 nodes in the cluster had NTR requests blocked.
  3.  As a result of 1.) and 2.) the Cassandra system resources spiked up.(CPU, 
IO, system load, # SStables (10 times, 250->2500), Memtable switch count, 
Pending compactions etc.)
  4.  One interesting thing we observed was the read calls with quorum 
consistency were not having any issues (high latencies and requests backing up) 
while the read calls with serial consistency were consistently failing on 
client side due to C* timeouts.
  5.  We used Nodetool removenode command to remove the ndoe from the cluster. 
The node wasn’t reachable (IP down).

One thing which we don’t understand is as soon as we remove the dead node from 
the cluster the system recovers within a minute(s). My main question is, is 
there a bug in C* with respect to Cassandra serial consistency calls getting 
blocked on some dead node resource and the resources getting released as soon 
as the dead node is removed from the cluster OR are we hitting some limit here?

Also, as the cluster size increases the impact of the dead node decreases on 
serial consistency read decreases (as in the latency spike up for a minute or 
two and the system automatically recovers).

Any pointers?

Thanks,
Pratik

From: Marc Selwan 
Reply-To: "user@cassandra.apache.org" 
Date: Monday, December 3, 2018 at 1:09 AM
To: "user@cassandra.apache.org" 
Cc: Jeff Jirsa , Ben Slater 
Subject: Re: Cassandra single unreachable node causing total cluster outage

Ben's question is a good one - What are the exact symptoms you're experiencing? 
Is it latency spikes? Nodes flapping? That'll help us figure out where to look.

When you removed the down node, which command did you use?

Best,
Marc

On Sun, Dec 2, 2018 at 1:36 PM Agrawal, Pratik  
wrote:
One other thing I forgot to add:

native_transport_max_threads: 128

we have commented this setting out, should we bound this? I am planning to 
experiment with this setting to bound it.

Thanks,
Pratik

From: "Agrawal, Pratik" mailto:paagr...@amazon.com>>
Date: Sunday, December 2, 2018 at 4:33 PM
To: "user@cassandra.apache.org" 
mailto:user@cassandra.apache.org>>, Jeff Jirsa 
mailto:jji...@gmail.com>>, Ben Slater 
mailto:ben.sla...@instaclustr.com>>

Subject: Re: Cassandra single unreachable node causing total cluster outage

I looked into some of the logs and I saw that at the time of the event the 
Native requests started getting blocked.

e.g.

 [INFO] org.apache.cassandra.utils.StatusLogger: Native-Transport-Requests  
 128   133   5179582116 19114

The number of blocked requests kept on increasing over the period of 5 minutes 
and became constant.

As soon as we remove the dead node from the cluster, things recover pretty 
quickly and cluster becomes stable.

Any pointers on what to look for debugging why requests are getting blocked 
when a nodes goes down??

Also, one other thing to note that we reproduced this scenario in our test 
environment and as we scale up the cluster the cluster automatically recover in 
matter of minutes without removing the node from the cluster. It seems like we 
are reaching some vertical scalability limit (maybe because of our 
configuration).


Thanks,
Pratik
From: Jeff Jirsa mailto:jji...@gmail.com>>
Reply-To: "user@cassandra.apache.org" 
mailto:user@cassandra.apache.org>>
Date: Tuesday, November 27, 2018 at 9:37 PM
To: "user@cassandra.apache.org" 
mailto:user@cassandra.apache.org>>
Subject: Re: Cassandra single unreachable node causing total cluster outage

Could also be the app not detecting the host is down and it keeps trying to use 
it as a coordinator

--
Jeff Jirsa


On Nov 27, 2018, at 6:33 PM, Ben Slater 
mailto:ben.sla...@instaclustr.com>> wrote:
In what way does the cluster become unstable (ie more specifically what are the 
symptoms)? My first thought would be the loss of the node causing the other 
nodes to 

Re: 1.2.19: AssertionError when running compactions on a CF with TTLed columns

2018-12-11 Thread Reynald Borer
Hi everyone,

I was finally able to sort out my problem in an "interesting" manner that I
think is worth sharing on the list!

What I did is the following: on each node, I stopped Cassandra, completely
dropped the data files of the column family, started Cassandra again and
issued a repair for this column family.

The process took time since the cluster is formed of 40 nodes, but once
done, the nodes didn't exhibit this assertion error anymore!

I believe this was triggered because of me tweaking the
"sstable_size_in_mb" parameter. Somehow I had data files with different
sizes and it confused Cassandra.

So, problem solved now :-)

Cheers,
Reynald


On Fri, Aug 31, 2018 at 7:45 AM Reynald Borer 
wrote:

> Hi everyone,
>
> I'm running a Cassandra 1.2.19 cluster of 40 nodes and compactions of a
> specific column family are sporadically raising an AssertionError like this
> (full stack trace visible under
> https://gist.github.com/rborer/46862d6d693c0163aa8fe0e74caa2d9a):
>
> ERROR [CompactionExecutor:9137] 2018-08-27 11:43:05,197
> org.apache.cassandra.service.CassandraDaemon - Exception in thread
> Thread[CompactionExecutor:9137,1,main]
> java.lang.AssertionError: 2
> at
> org.apache.cassandra.db.compaction.LeveledManifest.replace(LeveledManifest.java:267)
>
> The data written in this column family can be seen as wide rows, that is,
> rows with lots of columns. Each column has a TTL of 7 days though.
>
> Whenever this happens, it seems to block compactions of this column family
> (I see the pending compactions increasing) until I restart the failing node.
>
> I have searched on jira and on this mailing-list about this issue without
> too much luck. I suspect it may be related to
> https://issues.apache.org/jira/browse/CASSANDRA-6563 although it's hard
> for to confirm.
>
> I know this version is pretty old, does this issue anyway rings a bell to
> one of you?
>
> Here are some more details about my cluster:
>
> - it is composed of 40 nodes
> - it is pretty old and I'm in the process of upgrading it, thus it was
> running without issues under version 1.0.12 & 1.1.12
> - it really affect a single column family only (schema can be seen on
> https://gist.github.com/rborer/46862d6d693c0163aa8fe0e74caa2d9a#file-schema-txt
> )
> - my cluster is set up with RandomPartitioner (inherited from when it was
> set up on version 0.7) and a replication factor of 3
> - it's running weekly repairs (and this assertion happens mostly during
> repairs)
> - what I also noted is that since the cluster was upgraded to 1.2.19 the
> disk size of this column family keeps increasing (it went from 400G to
> 1.2T!)
>
> Thanks in advance for your help.
>
> Best regards,
> Reynald
>
>
>
>