Re: nodetool removenode causing the schema out of sync

2017-06-28 Thread Jeff Jirsa
On 2017-06-28 18:51 (-0700), Jai Bheemsen Rao Dhanwada wrote: > Hello, > > We are using C* version 2.1.6 and lately we are seeing an issue where, > nodetool removenode causing the schema to go out of sync and causing client > to fail for 2-3 minutes. > > C* cluster is

nodetool removenode causing the schema out of sync

2017-06-28 Thread Jai Bheemsen Rao Dhanwada
Hello, We are using C* version 2.1.6 and lately we are seeing an issue where, nodetool removenode causing the schema to go out of sync and causing client to fail for 2-3 minutes. C* cluster is in 8 Datacenters with RF=3 and has 50 nodes. We have 130 Keyspaces and 500 CF in the cluster. Here are

Re: nodetool repair failure

2017-06-28 Thread Akhil Mehra
nodetool repair has a trace option nodetool repair -tr yourkeyspacename see if that provides you with additional information. Regards, Akhil > On 28/06/2017, at 2:25 AM, Balaji Venkatesan > wrote: > > > We use Apache Cassandra 3.10-13 > > On Jun 26, 2017

Re: How do you monitoring Cassandra Cluster?

2017-06-28 Thread Petrus Gomes
I'm using JMX+Prometheus and Grafana. JMX = https://github.com/prometheus/jmx_exporter Prometheus + Grafana = https://prometheus.io/docs/visualization/grafana/ There are some dashboard examples like that: https://grafana.com/dashboards/371 Looks good. Thanks, Petrus Silva On Wed, Jun 28, 2017

Re: ALL range query monitors failing frequently

2017-06-28 Thread kurt greaves
I'd say that no, a range query probably isn't the best for monitoring, but it really depends on how important it is that the range you select is consistent. >From those traces it does seem that the bulk of the time spent was waiting for responses from the replicas, which may indicate a network

Re: Restore Snapshot

2017-06-28 Thread kurt greaves
Hm, I did recall seeing a ticket for this particular use case, which is certainly useful, I just didn't think it had been implemented yet. Turns out it's been in since 2.0.7, so you should be receiving writes with join_ring=false. If you confirm you aren't receiving writes then we have an issue.

Re: [IMPORTANT UPDATE]: PLEASE DO NOT UPDATE SCHEMA

2017-06-28 Thread Jeff Jirsa
If you suspect this is different than #13004, please don't keep it a secret - even if you haven't fixed it, if you can describe the steps to repro, that'd be incredibly helpful. - Jeff On 2017-06-27 13:42 (-0700), Michael Shuler wrote: > To clarify, are you talking

Re: Restore Snapshot

2017-06-28 Thread Anuj Wadehra
Thanks Kurt. I think the main scenario which MUST be addressed by snapshot is Backup/Restore so that a node can be restored with minimal time and the lengthy procedure of boottsrapping with join_ring=false followed by full repair can be avoided. The plain restore snapshot + repair scenario

Re: ALL range query monitors failing frequently

2017-06-28 Thread Matthew O'Riordan
Hi Kurt Thanks for the response. Few comments in line: On Wed, Jun 28, 2017 at 1:17 PM, kurt greaves wrote: > You're correct in that the timeout is only driver side. The server will > have its own timeouts configured in the cassandra.yaml file. > Yup, OK. I suspect

How do you monitoring Cassandra Cluster?

2017-06-28 Thread Peng Xiao
Dear All, we are currently using Cassandra 2.1.13,and it has grown to 5TB size with 32 nodes in one DC. For monitoring,opsCenter does not send alarm and not free in higher version.so we have to use a simple JMX+Zabbix template.And we plan to use Jolokia+JMX2Graphite to draw the metrics chart

Re: Restore Snapshot

2017-06-28 Thread kurt greaves
There are many scenarios where it can be useful, but to address what seems to be your main concern; you could simply restore and then only read at ALL until your repair completes. If you use snapshot restore with commitlog archiving you're in a better state, but granted the case you described can

Re: ALL range query monitors failing frequently

2017-06-28 Thread kurt greaves
You're correct in that the timeout is only driver side. The server will have its own timeouts configured in the cassandra.yaml file. I suspect either that you have a node down in your cluster (or 4), or your queries are gradually getting slower. This kind of aligns with the slow query statements

ALL range query monitors failing frequently

2017-06-28 Thread Matthew O'Riordan
We have a monitoring service that runs on all of our Cassandra nodes which performs different query types to ensure the cluster is healthy. We use different consistency levels for the queries and alert if any of them fail. All of our query types consistently succeed apart from our ALL range