Running Node Repair After Changing RF or Replication Strategy for a Keyspace

2019-06-28 Thread Fd Habash
Hi all …

The datastax & apache docs are clear: run ‘nodetool repair’ after you alter a 
keyspace to change its RF or RS.

However, the details are all over the place as what type of repair and on what 
nodes it needs to run. None of the above doc authorities are clear and what you 
find on the internet is quite contradictory.

For example, this IBM doc suggest to run both the ‘alter keyspace’ and repair 
on EACH node affected or on ‘each node you need to change the RF on’.  Others, 
suggest to run ‘repair -pr’. 

On a cluster of 1 DC and three racks, this is how I understand it ….
1. Run the ‘alter keyspace’ on a SINGLE node. 
2. As for repairing the altered keyspac, I assume there are two options …
a. Run ‘repair -full [key_space]’ on all nodes in all racks
b. Run ‘repair -pr -full [keyspace] on all nodes in all racks

Sounds correct? 


Thank you



Is There a Way To Proactively Monitor Reads Returning No Data Due to Consistency Level?

2019-05-07 Thread Fd Habash
Typically, when a read is submitted to C*, it may complete  with  …
1. No errors & returns expected data
2. Errors out with UnavailableException
3. No error & returns zero rows on first attempt, but returned on subsequent 
runs.

The third scenario happens as a result of cluster entropy specially during 
unexpected outages affecting on-premise or cloud infrastructures.

Typical scenario …
a) Multiple nodes fail in the cluster
b) Node replaced via bootstrapping
c) Row is in Cassandra, but client hits nodes that do not have the data yet. 
Gets zero rows. Row is retrieved on third or forth attempts and read repairs 
takes care of it.
d) Eventually, repair is run and issue is fixed.

Digging in Cassandra metrics, I’ve found ‘cassandra.unavailables.count’. Looks 
like this metrics captures scenario ' UnavailableException’, however.

I have also read the Yelp article describing a metric they called 
‘underreplicated keyspaces’. These are keyspace ranges that will fail to 
satisfy reads/write at a certain CL due to insufficient endpoints. If my 
understanding is correct, this is also measuring scenario 2. 

Tying to find a metric to capture scenario 3 above. Is this possible at all?




Thank you



CL=LQ, RF=3: Can a Write be Lost If Two Nodes ACK'ing it Die

2019-05-02 Thread Fd Habash
C*: 2.2.8
Write CL = LQ
Kspace RF = 3
Three racks

A write gets received by node 1 in rack 1 at above specs. Node 1 (rack1) & node 
2 (rack2)  acknowledge it to the client. 

Within some unit of time, node 1 & 2 die. Either ….
- Scenario 1: C* process death: Row did not make it to sstable (it is in commit 
log & was in memtable)
- Scenario 2: Node death: row may be have made to sstable, but nodes are gone 
(will have to bootstrap to replace).

Scenario 1: Row is not lost because once C* is restarted, commit log should 
replay the mutation.

Scenario 2: row is gone forever? If these two nodes are replaced via 
bootstrapping, will they ever get the row back from node 3 (rack3) if the write 
ever made it there?



Thank you



RE: Bootstrapping to Replace a Dead Node vs. Adding a NewNode:Consistency Guarantees

2019-05-01 Thread Fd Habash
Appreciate your response. 

As for extending the cluster & keeping the default range movement = true, C* 
won’t allow  me to bootstrap multiples nodes, anyway. 

But, the question I’m still posing and have not gotten an answer for, is if fix 
Cassandra-2434 disallows bootstrapping multiple nodes to extend the cluster 
(which I was able to test in my lab cluster), why did it allow to bootstrap 
multiple nodes in the process of replacing dead nodes (no range calc).

This fix forces a node to boostrap from former owner. Is this still the case 
also when bootstrapping when replacing dead node.



Thank you

From: ZAIDI, ASAD A
Sent: Wednesday, May 1, 2019 5:13 PM
To: user@cassandra.apache.org
Subject: RE: Bootstrapping to Replace a Dead Node vs. Adding a 
NewNode:Consistency Guarantees


The article you mentioned here clearly says  “For new users to Cassandra, the 
safest way to add multiple nodes into a cluster is to add them one at a time. 
Stay tuned as I will be following up with another post on bootstrapping.” 

When extending cluster it is indeed recommended to go slow & serially. 
Optionally you can use cassandra.consistent.rangemovement=false but you can run 
in getting over streamed data.  Since you’re using release way newer when fixed 
introduced , I assumed you won’t see same behavior as described for the version 
which fix addresses. After adding node , if you won’t get  consistent data, you 
query consistency level should be able to pull consistent data , given you can 
tolerate bit latency until your repair is complete – if you go by 
recommendation i.e. to add one node at a time – you’ll avoid all these nuances .



From: Fd Habash [mailto:fmhab...@gmail.com] 
Sent: Wednesday, May 01, 2019 3:12 PM
To: user@cassandra.apache.org
Subject: RE: Bootstrapping to Replace a Dead Node vs. Adding a New 
Node:Consistency Guarantees

Probably, I needed to be clearer in my inquiry ….

I’m investigating a situation where our diagnostic data is telling us that C* 
has lost some of the application data. I mean, getsstables for the data returns 
zero on all nodes in all racks. 

The last pickle article below & Jeff Jirsa had described a situation where 
bootstrapping a node to extend the cluster can loose data if this new node 
bootstraps from a stale SECONDARY replica (node that was offline > hinted 
had-off window). This was fixed in cassandra-2434. 
http://thelastpickle.com/blog/2017/05/23/auto-bootstrapping-part1.html

The article & the Jira above describe bootstrapping when extending a cluster.

I understand replacing a dead node does not involve range movement, but will 
the above Jira fix prevent the bootstrap process when a replacing a dead node 
from using secondary replica?

Thanks 


Thank you

From: Fred Habash
Sent: Wednesday, May 1, 2019 6:50 AM
To: user@cassandra.apache.org
Subject: Re: Bootstrapping to Replace a Dead Node vs. Adding a New 
Node:Consistency Guarantees

Thank you. 

Range movement is one reason this is enforced when adding a new node. But, what 
about forcing a consistent bootstrap i.e. bootstrapping from primary owner of 
the range and not a secondary replica. 

How’s consistent bootstrap enforced when replacing a dead node. 

-
Thank you. 

On Apr 30, 2019, at 7:40 PM, Alok Dwivedi  wrote:
When a new node joins the ring, it needs to own new token ranges. This should 
be unique to the new node and we don’t want to end up in a situation where two 
nodes joining simultaneously can own same range (and ideally evenly 
distributed). Cassandra has this 2 minute wait rule for gossip state to 
propagate before a node is added.  But this on its does not guarantees that 
token ranges can’t overlap. See this ticket for more details 
https://issues.apache.org/jira/browse/CASSANDRA-7069 To overcome this  issue, 
the approach was to only allow one node joining at a time. 
 
When you replace a dead node the new token range selection does not applies as 
the replacing node just owns the token ranges of the dead node. I think that’s 
why the restriction of only replacing one node at a time does not applies in 
this case. 
 
 
Thanks 
Alok Dwivedi
Senior Consultant
https://www.instaclustr.com/platform/
 
 
 
 
 
From: Fd Habash 
Reply-To: "user@cassandra.apache.org" 
Date: Wednesday, 1 May 2019 at 06:18
To: "user@cassandra.apache.org" 
Subject: Bootstrapping to Replace a Dead Node vs. Adding a New Node: 
Consistency Guarantees 
 
Reviewing the documentation &  based on my testing, using C* 2.2.8, I was not 
able to extend the cluster by adding multiple nodes simultaneously. I got an 
error message …
 
Other bootstrapping/leaving/moving nodes detected, cannot bootstrap while 
cassandra.consistent.rangemovement is true
 
I understand this is to force a node to bootstrap from the former owner of the 
range when adding a node as part of extending the cluster.
 
However, I was able to bootstrap multiple nodes to rep

RE: Bootstrapping to Replace a Dead Node vs. Adding a New Node:Consistency Guarantees

2019-05-01 Thread Fd Habash
Probably, I needed to be clearer in my inquiry ….

I’m investigating a situation where our diagnostic data is telling us that C* 
has lost some of the application data. I mean, getsstables for the data returns 
zero on all nodes in all racks. 

The last pickle article below & Jeff Jirsa had described a situation where 
bootstrapping a node to extend the cluster can loose data if this new node 
bootstraps from a stale SECONDARY replica (node that was offline > hinted 
had-off window). This was fixed in cassandra-2434. 
http://thelastpickle.com/blog/2017/05/23/auto-bootstrapping-part1.html

The article & the Jira above describe bootstrapping when extending a cluster.

I understand replacing a dead node does not involve range movement, but will 
the above Jira fix prevent the bootstrap process when a replacing a dead node 
from using secondary replica?

Thanks 


Thank you

From: Fred Habash
Sent: Wednesday, May 1, 2019 6:50 AM
To: user@cassandra.apache.org
Subject: Re: Bootstrapping to Replace a Dead Node vs. Adding a New 
Node:Consistency Guarantees

Thank you. 

Range movement is one reason this is enforced when adding a new node. But, what 
about forcing a consistent bootstrap i.e. bootstrapping from primary owner of 
the range and not a secondary replica. 

How’s consistent bootstrap enforced when replacing a dead node. 

-
Thank you. 

On Apr 30, 2019, at 7:40 PM, Alok Dwivedi  wrote:
When a new node joins the ring, it needs to own new token ranges. This should 
be unique to the new node and we don’t want to end up in a situation where two 
nodes joining simultaneously can own same range (and ideally evenly 
distributed). Cassandra has this 2 minute wait rule for gossip state to 
propagate before a node is added.  But this on its does not guarantees that 
token ranges can’t overlap. See this ticket for more details 
https://issues.apache.org/jira/browse/CASSANDRA-7069 To overcome this  issue, 
the approach was to only allow one node joining at a time. 
 
When you replace a dead node the new token range selection does not applies as 
the replacing node just owns the token ranges of the dead node. I think that’s 
why the restriction of only replacing one node at a time does not applies in 
this case. 
 
 
Thanks 
Alok Dwivedi
Senior Consultant
https://www.instaclustr.com/platform/
 
 
 
 
 
From: Fd Habash 
Reply-To: "user@cassandra.apache.org" 
Date: Wednesday, 1 May 2019 at 06:18
To: "user@cassandra.apache.org" 
Subject: Bootstrapping to Replace a Dead Node vs. Adding a New Node: 
Consistency Guarantees 
 
Reviewing the documentation &  based on my testing, using C* 2.2.8, I was not 
able to extend the cluster by adding multiple nodes simultaneously. I got an 
error message …
 
Other bootstrapping/leaving/moving nodes detected, cannot bootstrap while 
cassandra.consistent.rangemovement is true
 
I understand this is to force a node to bootstrap from the former owner of the 
range when adding a node as part of extending the cluster.
 
However, I was able to bootstrap multiple nodes to replace dead nodes. C* did 
not complain about it.
 
Is consistent range movement & the guarantee it offers to bootstrap from 
primary range owner not applicable when bootstrapping to replace dead nodes? 
 

Thank you
 



Bootstrapping to Replace a Dead Node vs. Adding a New Node: Consistency Guarantees

2019-04-30 Thread Fd Habash
Reviewing the documentation &  based on my testing, using C* 2.2.8, I was not 
able to extend the cluster by adding multiple nodes simultaneously. I got an 
error message …

Other bootstrapping/leaving/moving nodes detected, cannot bootstrap while 
cassandra.consistent.rangemovement is true

I understand this is to force a node to bootstrap from the former owner of the 
range when adding a node as part of extending the cluster.

However, I was able to bootstrap multiple nodes to replace dead nodes. C* did 
not complain about it.

Is consistent range movement & the guarantee it offers to bootstrap from 
primary range owner not applicable when bootstrapping to replace dead nodes? 


Thank you



RE: A keyspace with RF=3, Cluster with 3 RACS, CL=LQ: No Data on FirstAttempt, but 1 Row Aftwards

2019-04-23 Thread Fd Habash
Any ideas, please? 


Thank you

From: Fd Habash
Sent: Tuesday, April 23, 2019 10:38 AM
To: user@cassandra.apache.org
Subject: A keyspace with RF=3, Cluster with 3 RACS, CL=LQ: No Data on 
FirstAttempt, but 1 Row Aftwards

Cluster setup …
- C* 2.2.8
- Three RACs, one DC
- Keyspace with RF=3
- RS = Network topology 

At CL=LQ …

I get zero rows on first attempt, and one row on the second or third. Once 
found, I always get the row afterwards. 

Trying to understand this behavior …

First attempt, my read request hits a RAC that simply does not have the data. 
Subsequent attempts hit another RAC that has it which triggers a read repair 
causing the row to be returned consistently afterwards. Is this correct?

If a coordinator picks a node in the same RAC and the node does not have the 
data on disk, is it going to stop there and return nothing even though the row 
does exist on another RAC?

If anti-entropy repair has completed successfully on the entire cluster ‘repair 
-pr’, why is this still happening? 



Thank you




A keyspace with RF=3, Cluster with 3 RACS, CL=LQ: No Data on First Attempt, but 1 Row Aftwards

2019-04-23 Thread Fd Habash
Cluster setup …
- C* 2.2.8
- Three RACs, one DC
- Keyspace with RF=3
- RS = Network topology 

At CL=LQ …

I get zero rows on first attempt, and one row on the second or third. Once 
found, I always get the row afterwards. 

Trying to understand this behavior …

First attempt, my read request hits a RAC that simply does not have the data. 
Subsequent attempts hit another RAC that has it which triggers a read repair 
causing the row to be returned consistently afterwards. Is this correct?

If a coordinator picks a node in the same RAC and the node does not have the 
data on disk, is it going to stop there and return nothing even though the row 
does exist on another RAC?

If anti-entropy repair has completed successfully on the entire cluster ‘repair 
-pr’, why is this still happening? 



Thank you



RE: Cannot replace_address /10.xx.xx.xx because it doesn't exist ingossip

2019-03-14 Thread Fd Habash
I can conclusively say, none of these commands were run. However, I think this 
is  the likely scenario …

If you have a cluster of three nodes 1,2,3 …
- If 3 shows as DN
- Restart C* on 1 & 2
- Nodetool status should NOT show node 3 IP at all.

Restarting the cluster while a node is down resets gossip state. 

There is a good chance this is what happened. 

Plausible? 


Thank you

From: Jeff Jirsa
Sent: Thursday, March 14, 2019 11:06 AM
To: cassandra
Subject: Re: Cannot replace_address /10.xx.xx.xx because it doesn't exist 
ingossip

Two things that wouldn't be a bug:

You could have run removenode
You could have run assassinate

Also could be some new bug, but that's much less likely. 


On Thu, Mar 14, 2019 at 2:50 PM Fd Habash  wrote:
I have a node which I know for certain was a cluster member last week. It 
showed in nodetool status as DN. When I attempted to replace it today, I got 
this message 
 
ERROR [main] 2019-03-14 14:40:49,208 CassandraDaemon.java:654 - Exception 
encountered during startup
java.lang.RuntimeException: Cannot replace_address /10.xx.xx.xxx.xx because it 
doesn't exist in gossip
    at 
org.apache.cassandra.service.StorageService.prepareReplacementInfo(StorageService.java:449)
 ~[apache-cassandra-2.2.8.jar:2.2.8]
 
 
DN  10.xx.xx.xx  388.43 KB  256  6.9%  
bdbd632a-bf5d-44d4-b220-f17f258c4701  1e
 
Under what conditions does this happen?
 
 

Thank you
 



Cannot replace_address /10.xx.xx.xx because it doesn't exist in gossip

2019-03-14 Thread Fd Habash
I have a node which I know for certain was a cluster member last week. It 
showed in nodetool status as DN. When I attempted to replace it today, I got 
this message 

ERROR [main] 2019-03-14 14:40:49,208 CassandraDaemon.java:654 - Exception 
encountered during startup
java.lang.RuntimeException: Cannot replace_address /10.xx.xx.xxx.xx because it 
doesn't exist in gossip
at 
org.apache.cassandra.service.StorageService.prepareReplacementInfo(StorageService.java:449)
 ~[apache-cassandra-2.2.8.jar:2.2.8]


DN  10.xx.xx.xx  388.43 KB  256  6.9%  
bdbd632a-bf5d-44d4-b220-f17f258c4701  1e

Under what conditions does this happen?



Thank you



Loss of an Entire AZ in a Three-AZ Cassandra Cluster

2019-03-08 Thread Fd Habash
Assume you have a 30 node cluster distributed across three AZ’s with an RF of 
3. Trying to come up with a runbook to manage multi-nodes failure as a result 
of …

- Loss of an entire AZ1
- Loss of multiple nodes in AZ2
- AZ3 unaffected. No node loss

Is this is most optimal plan. Replacing dead nodes via bootstrapping  …

1. Replace seeds nodes first (via bootstrap)
2. Bootstrap the few nodes in AZ2
3. Bootstrap all nodes in AZ1
4. Run a cluster repair.

Do you wait to bootstrap everything before running repair or do you repair per 
node?
Did I miss anything? 


Thank you



Migrating to Reaper: Switching From Incremental to Reaper's Full Subrange Repair

2018-06-13 Thread Fd Habash
For those who are using Reaper …

Currently, I'm run repairs using crontab/nodetool using 'repair -pr' on 2.2.8 
which defaults to incremental. If I migrate to Reaper, do I have to mark 
sstables as un-repaired first? Also, out of the box, does Reaper run full 
parallel repair? If yes, is it not going to cause over-streaming since we are 
repairing ranges multiple times?


Thank you



RE: Read Latency Doubles After Shrinking Cluster and Never Recovers

2018-06-11 Thread Fd Habash
I will check for both.

On a different subject, I have read some user testimonies that running 
‘nodetool cleanup’ requires a C* process reboot at least around 2.2.8. Is this 
true?



Thank you

From: Nitan Kainth
Sent: Monday, June 11, 2018 10:40 AM
To: user@cassandra.apache.org
Subject: Re: Read Latency Doubles After Shrinking Cluster and Never Recovers

I think it would because it Cassandra will process more sstables to create 
response to read queries.

Now after clean if the data volume is same and compaction has been running, I 
can’t think of any more diagnostic step. Let’s wait for other experts to 
comment.

Can you also check sstable count for each table just to be sure that they are 
not extraordinarily high?
Sent from my iPhone

On Jun 11, 2018, at 10:21 AM, Fd Habash  wrote:
Yes we did after adding the three nodes back and a full cluster repair as well. 
 
But even it we didn’t run cleanup, would it have impacted read latency the fact 
that some nodes still have sstables that they no longer need? 
 
Thanks 
 

Thank you
 
From: Nitan Kainth
Sent: Monday, June 11, 2018 10:18 AM
To: user@cassandra.apache.org
Subject: Re: Read Latency Doubles After Shrinking Cluster and Never Recovers
 
Did you run cleanup too? 
 
On Mon, Jun 11, 2018 at 10:16 AM, Fred Habash  wrote:
I have hit dead-ends every where I turned on this issue. 
 
We had a 15-node cluster  that was doing 35 ms all along for years. At some 
point, we made a decision to shrink it to 13. Read latency rose to near 70 ms. 
Shortly after, we decided this was not acceptable, so we added the three nodes 
back in. Read latency dropped to near 50 ms and it has been hovering around 
this value for over 6 months now.
 
Repairs run regularly, load on cluster nodes is even,  application activity 
profile has not changed. 
 
Why are we unable to get back the same read latency now that the cluster is 15 
nodes large same as it was before?
 
-- 
 

Thank you


 
 



RE: Read Latency Doubles After Shrinking Cluster and Never Recovers

2018-06-11 Thread Fd Habash
Yes we did after adding the three nodes back and a full cluster repair as well. 

But even it we didn’t run cleanup, would it have impacted read latency the fact 
that some nodes still have sstables that they no longer need? 

Thanks 


Thank you

From: Nitan Kainth
Sent: Monday, June 11, 2018 10:18 AM
To: user@cassandra.apache.org
Subject: Re: Read Latency Doubles After Shrinking Cluster and Never Recovers

Did you run cleanup too? 

On Mon, Jun 11, 2018 at 10:16 AM, Fred Habash  wrote:
I have hit dead-ends every where I turned on this issue. 

We had a 15-node cluster  that was doing 35 ms all along for years. At some 
point, we made a decision to shrink it to 13. Read latency rose to near 70 ms. 
Shortly after, we decided this was not acceptable, so we added the three nodes 
back in. Read latency dropped to near 50 ms and it has been hovering around 
this value for over 6 months now.

Repairs run regularly, load on cluster nodes is even,  application activity 
profile has not changed. 

Why are we unable to get back the same read latency now that the cluster is 15 
nodes large same as it was before?

-- 


Thank you





RE: Cassandra upgrade from 2.2.8 to 3.10

2018-03-28 Thread Fd Habash
Thank you.

In regards to my second inquiry, as we plan for C* upgrades, I did not find the 
NEWS.txt always to be telling of possible upgrade paths. Is there a rule of 
thumb or may be an official reference for upgrade paths?




Thank you

From: Alexander Dejanovski
Sent: Wednesday, March 28, 2018 1:58 PM
To: user@cassandra.apache.org
Subject: Re: Cassandra upgrade from 2.2.8 to 3.10

You can perform an upgrade from 2.2.x straight to 3.11.2, but the op suggests 
adding nodes in 3.10 to a cluster that runs 2.2.8, which is why Jeff says it 
won't work.

I see no reason to upgrade to 3.10 and not 3.11.2 by the way.

On Wed, Mar 28, 2018 at 5:10 PM Fred Habash  wrote:
Hi ...
I'm finding anecdotal evidence on the internet that we are able to upgrade 
2.2.8 to latest 3.11.2. Post below indicates that you can upgrade to latest 3.x 
from 2.1.9 because 3.x no longer requires 'structured upgrade path'.

I just want to confirm that such upgrade is supported. If yes, where can I find 
official documentation showing upgrade path across releases.

https://stackoverflow.com/questions/42094935/apache-cassandra-upgrade-3-x-from-2-1

Thanks 

On Mon, Aug 7, 2017 at 5:58 PM, ZAIDI, ASAD A  wrote:
Hi folks, I’ve question on upgrade method I’m thinking to execute.
 
I’m  planning from apache-Cassandra 2.2.8 to release 3.10.
 
My Cassandra cluster is configured like one rack with two Datacenters like:
 
1.   DC1 has 4 nodes 
2.   DC2 has 16 nodes
 
We’re adding another 12 nodes and would eventually need to remove those 4 nodes 
in DC1.
 
I’m thinking to add another third data center with like DC3 with 12 nodes 
having apache Cassandra 3.10 installed. Then, I start upgrading seed nodes 
first in DC1 & DC2 – once all 20nodes in ( DC1 plus DC2) upgraded – I can 
safely remove 4 DC1 nodes,
can you guys please let me know if this approach would work? I’m concerned if 
having mixed version on Cassandra nodes may  cause any issues like in streaming 
 data/sstables from existing DC to newly created third DC with version 3.10 
installed, will nodes in DC3 join the cluster with data without issues?
 
Thanks/Asad
 
 
 
 




-- 


Thank you ...

Fred Habash, Database Solutions Architect (Oracle OCP 8i,9i,10g,11g)

-- 
-
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com



RE: On a 12-node Cluster, Starting C* on a Seed Node Increases ReadLatency from 150ms to 1.5 sec.

2018-03-02 Thread Fd Habash
I understand you use Apache Cassandra 2.2.8. :)
- Yes. It was a typo

In Apache Cassandra 2.2.8, this triggers incremental repairs I believe,
- Yes, default as of 2.2 and using primary range which repairs runs on every 
node in the cluster

. Did you replace the node in-place?
- Yes. We removed from its seed provider list. Otherwise, it won’t bootstrap. . 

You should be able to have nodes going down, or being fairly slow …
- When we stopped C* on this node, read performance recovered well. Once 
started, and now with no repairs running at all, latency increased bad to over 
1.5 secs. This affected the node (in AZ 1) and the other 8 nodes ( 4 in AZ 2 
and 4 in AZ 3). That is, it slowed down the other 2 AZ’s. 
- The application reads with CL=LQ
- This behavior I do not understand. There is no streaming.

My coworker Alexander wrote about this a few month ago, i
- We have been looking into Reaper for past 2 months. Work in progress. 

And thank you for the thorough response. 


From: Alain RODRIGUEZ
Sent: Friday, March 2, 2018 11:43 AM
To: user cassandra.apache.org
Subject: Re: On a 12-node Cluster, Starting C* on a Seed Node Increases 
ReadLatency from 150ms to 1.5 sec.

Hello,

This is a 2.8.8. cluster

That's an exotic version!

I understand you use Apache Cassandra 2.2.8. :)

This single node was a seed node and it was running a ‘repair -pr’ at the time 

In Apache Cassandra 2.2.8, this triggers incremental repairs I believe, and 
they are relatively (some would say completely) broken. Let's say they caused a 
lot of troubles in many cases. If I am wrong and you are not running 
incremental repairs (default in your version off the top of my head) then you 
node might not have enough resource available to handle both the repair and the 
standard load. It might be something to check.

Consequences of incremental repairs are:

- Keeping SSTables split between repaired and not repaired table, increasing 
the number of SSTable
- Anti-compaction (splits SSTables) is used to keep them grouped.

This induces a lot of performances downsides such as (but not only):

- inefficient tombstone eviction
- More disk hit for the same queries
- More compaction work

Machine are then performing very poorly.

My coworker Alexander wrote about this a few month ago, it might be of 
interest: 
http://thelastpickle.com/blog/2017/12/14/should-you-use-incremental-repair.html
If repairs are a pain point, you might be interested in checking 
http://cassandra-reaper.io/, that aims at making this operation easier and more 
efficient.

I would say the fact this node is a seed nodes did not impact here, it is a 
coincidence due to the fact you picked a seed for the repair. Seed nodes are 
mostly working as any other node, excepted during bootstrap.

So we decided to bootstrap it.

I am not sure what happen when bootstrapping a seed node. I always removed it 
from the seed list first. Did you replace the node in-place? I guess if you had 
no warning and have no consistency issues, it's all good.

All we were able to see is that the seed node in question was different in that 
it had 5000 sstables while all others had around 2300. After bootstrap, seed 
node sstables reduced to 2500.

I would say this is fairly common (even more when using vnodes) as streaming of 
the data from all the other nodes is fast and compaction might take a while to 
catch up.

Why would starting C* on a single seed node affect the cluster this bad? 

That's a fair question. It depends on factors such as the client configuration, 
the replication factor, the consistency level used. If the node is involved in 
some reads, then the average latency will drop.

You should be able to have nodes going down, or being fairly slow and use the 
right nodes if the client is recent enough and well configured.
 
Is it gossip?

It might be, there were issues, but I believe in previous versions and / or on 
bigger cluster. I would dig for a 'repair' issue first, it seems more probable 
to me. 

I hope this helped,

C*heers,
---
Alain Rodriguez - @arodream - al...@thelastpickle.com
France / Spain

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com

2018-03-02 14:42 GMT+00:00 Fd Habash <fmhab...@gmail.com>:
This is a 2.8.8. cluster with three AWS AZs, each with 4 nodes.
 
Few days ago we noticed a single node’s read latency reaching 1.5 secs there 
was 8 others with read latencies going up near 900 ms. 
 
This single node was a seed node and it was running a ‘repair -pr’ at the time. 
We intervened as follows …
 
• Stopping compactions during repair did not improve latency.
• Killing repair brought down latency to 200 ms on the seed node and the other 
8.
• Restarting C* on the seed node increased latency again back to near 1.5 secs 
on the seed and other 8. At this point, there was no repair running and 
compactions were running. We left them alone. 
 
At this point, we saw that putting the seed node back in the cluster 
consis

On a 12-node Cluster, Starting C* on a Seed Node Increases Read Latency from 150ms to 1.5 sec.

2018-03-02 Thread Fd Habash
This is a 2.8.8. cluster with three AWS AZs, each with 4 nodes.

Few days ago we noticed a single node’s read latency reaching 1.5 secs there 
was 8 others with read latencies going up near 900 ms. 

This single node was a seed node and it was running a ‘repair -pr’ at the time. 
We intervened as follows …

• Stopping compactions during repair did not improve latency.
• Killing repair brought down latency to 200 ms on the seed node and the other 
8.
• Restarting C* on the seed node increased latency again back to near 1.5 secs 
on the seed and other 8. At this point, there was no repair running and 
compactions were running. We left them alone. 

At this point, we saw that putting the seed node back in the cluster 
consistently worsened latencies on seed and 8 nodes = 9 out of the 12 nodes in 
the cluster. 

So we decided to bootstrap it. During the bootstrapping and afterwards, 
latencies remained near 200 ms which is what we wanted for now. 

All we were able to see is that the seed node in question was different in that 
it had 5000 sstables while all others had around 2300. After bootstrap, seed 
node sstables reduced to 2500.

Why would starting C* on a single seed node affect the cluster this bad? Again, 
no repair just 4 compactions that run routinely on it as well all others. Is it 
gossip?  What other plausible explanations are there?


Thank you



RE: Cluster Repairs 'nodetool repair -pr' Cause Severe IncreaseinRead Latency After Shrinking Cluster

2018-02-22 Thread Fd Habash
One more observation …

When we compare read latencies between non-prod (where nodes were removed) to 
prod clusters, even though the node load as measure by size of /data dir is 
similar, yet the read latencies are 5 times slower in the downsized non-prod 
cluster.

The only difference we see is that prod reads from 4 sstables whereas non-prod 
reads from 5 as cfhistograms. 

Non-prod /data size
-
Filesystem  Size  Used Avail Use% Mounted on
/dev/nvme0n1885G  454G  432G  52% /data
Filesystem  Size  Used Avail Use% Mounted on
/dev/nvme0n1885G  439G  446G  50% /data
Filesystem  Size  Used Avail Use% Mounted on
/dev/nvme0n1885G  368G  518G  42% /data
Filesystem  Size  Used Avail Use% Mounted on
/dev/nvme0n1885G  431G  455G  49% /data
Filesystem  Size  Used Avail Use% Mounted on
/dev/nvme0n1885G  463G  423G  53% /data
Filesystem  Size  Used Avail Use% Mounted on
/dev/nvme0n1885G  406G  479G  46% /data
Filesystem  Size  Used Avail Use% Mounted on
/dev/nvme0n1885G  419G  466G  48% /data
Filesystem  Size  Used Avail Use% Mounted on

Prod /data size

Filesystem  Size  Used Avail Use% Mounted on
/dev/nvme0n1885G  352G  534G  40% /data
Filesystem  Size  Used Avail Use% Mounted on
/dev/nvme0n1885G  423G  462G  48% /data
Filesystem  Size  Used Avail Use% Mounted on
/dev/nvme0n1885G  431G  454G  49% /data
Filesystem  Size  Used Avail Use% Mounted on
/dev/nvme0n1885G  442G  443G  50% /data
Filesystem  Size  Used Avail Use% Mounted on
/dev/nvme0n1885G  454G  431G  52% /data


Cfhistograms: comparing prod to non-prod
-

Non-prod
--
08:21:38Percentile  SSTables Write Latency  Read 
LatencyPartition SizeCell Count
08:21:38  (micros)  
(micros)   (bytes)  
08:21:3850% 1.00 24.60   
4055.27 11864 4
08:21:3875% 2.00 35.43  
14530.76 17084 4
08:21:3895% 4.00126.93  
89970.66 35425 4
08:21:3898% 5.00219.34 
155469.30 73457 4
08:21:3899% 5.00219.34 
186563.16105778 4
08:21:38Min 0.00  5.72 
17.0987 3
08:21:38Max 7.00  20924.30
1386179.89  14530764 4

Prod
--- 
07:41:42Percentile  SSTables Write Latency  Read 
LatencyPartition SizeCell Count
07:41:42  (micros)  
(micros)   (bytes)  
07:41:4250% 1.00 24.60   
2346.80 11864 4
07:41:4275% 2.00 29.52   
4866.32 17084 4
07:41:4295% 3.00 73.46  
14530.76 29521 4
07:41:4298% 4.00182.79  
25109.16 61214 4
07:41:4299% 4.00182.79  
36157.19 88148 4
07:41:42Min 0.00  9.89 
20.5087 0
07:41:42Max 5.00219.34 
155469.30  12108970 4



Thank you

From: Fd Habash
Sent: Thursday, February 22, 2018 9:00 AM
To: user@cassandra.apache.org
Subject: RE: Cluster Repairs 'nodetool repair -pr' Cause Severe IncreaseinRead 
Latency After Shrinking Cluster


“ data was allowed to fully rebalance/repair/drain before the next node was 
taken off?”
--
Judging by the messages, the decomm was healthy. As an example 

  StorageService.java:3425 - Announcing that I have left the ring for 3ms   
…
INFO  [RMI TCP Connection(4)-127.0.0.1] 2016-01-07 06:00:52,662 
StorageService.java:1191 – DECOMMISSIONED

I do not believe repairs were run after each node removal. I’ll double-check. 

I’m not sure what you mean by ‘rebalance’? How do you check if a node is 
balanced? Load/size of data dir? 

As for the drain, there was no need to drain and I believe it is not something 
you do as part of decomm’ing a node. 

did you take 1 off per rack/AZ?
--
We removed 3 nodes, one from each AZ in sequence

These are some

RE: Cluster Repairs 'nodetool repair -pr' Cause Severe Increase inRead Latency After Shrinking Cluster

2018-02-22 Thread Fd Habash

“ data was allowed to fully rebalance/repair/drain before the next node was 
taken off?”
--
Judging by the messages, the decomm was healthy. As an example 

  StorageService.java:3425 - Announcing that I have left the ring for 3ms   
 
…
INFO  [RMI TCP Connection(4)-127.0.0.1] 2016-01-07 06:00:52,662 
StorageService.java:1191 – DECOMMISSIONED

I do not believe repairs were run after each node removal. I’ll double-check. 

I’m not sure what you mean by ‘rebalance’? How do you check if a node is 
balanced? Load/size of data dir? 

As for the drain, there was no need to drain and I believe it is not something 
you do as part of decomm’ing a node. 

did you take 1 off per rack/AZ?
--
We removed 3 nodes, one from each AZ in sequence

These are some of the cfhistogram metrics. Read latencies are high after the 
removal of the nodes
--
You can see reads of 186ms are at the 99th% from 5 sstables. There are awfully 
high numbers given that these metrics measure C* storage layer read 
performance. 

Does this mean removing the nodes undersized the cluster? 

key_space_01/cf_01 histograms
Percentile  SSTables Write Latency  Read LatencyPartition Size  
  Cell Count
  (micros)  (micros)   (bytes)  

50% 1.00 24.60   4055.27 11864  
   4
75% 2.00 35.43  14530.76 17084  
   4
95% 4.00126.93  89970.66 35425  
   4
98% 5.00219.34 155469.30 73457  
   4
99% 5.00219.34 186563.16105778  
   4
Min 0.00  5.72 17.0987  
   3
Max 7.00  20924.301386179.89  14530764  
   4

key_space_01/cf_01 histograms
Percentile  SSTables Write Latency  Read LatencyPartition Size  
  Cell Count
  (micros)  (micros)   (bytes)  

50% 1.00 29.52   4055.27 11864  
   4
75% 2.00 42.51  10090.81 17084  
   4
95% 4.00152.32  52066.35 35425  
   4
98% 4.00219.34  89970.66 73457  
   4
99% 5.00219.34 155469.30 88148  
   4
Min 0.00  9.89 24.6087  
   0
Max 6.00   1955.67 557074.61  14530764  
   4


Thank you

From: Carl Mueller
Sent: Wednesday, February 21, 2018 4:33 PM
To: user@cassandra.apache.org
Subject: Re: Cluster Repairs 'nodetool repair -pr' Cause Severe Increase inRead 
Latency After Shrinking Cluster

Hm nodetool decommision performs the streamout of the replicated data, and you 
said that was apparently without error...

But if you dropped three nodes in one AZ/rack on a five node with RF3, then we 
have a missing RF factor unless NetworkTopologyStrategy fails over to another 
AZ. But that would also entail cross-az streaming and queries and repair.

On Wed, Feb 21, 2018 at 3:30 PM, Carl Mueller <carl.muel...@smartthings.com> 
wrote:
sorry for the idiot questions... 

data was allowed to fully rebalance/repair/drain before the next node was taken 
off?

did you take 1 off per rack/AZ?


On Wed, Feb 21, 2018 at 12:29 PM, Fred Habash <fmhab...@gmail.com> wrote:
One node at a time 

On Feb 21, 2018 10:23 AM, "Carl Mueller" <carl.muel...@smartthings.com> wrote:
What is your replication factor? 
Single datacenter, three availability zones, is that right?
You removed one node at a time or three at once?

On Wed, Feb 21, 2018 at 10:20 AM, Fd Habash <fmhab...@gmail.com> wrote:
We have had a 15 node cluster across three zones and cluster repairs using 
‘nodetool repair -pr’ took about 3 hours to finish. Lately, we shrunk the 
cluster to 12. Since then, same repair job has taken up to 12 hours to finish 
and most times, it never does. 
 
More importantly, at some point during the repair cycle, we see read latencies 
jumping to 1-2 seconds and applications immediately notice the impact.
 
stream_throughput_outbound_megabits_per_sec is set at 200 and 
compaction_throughput_mb_per_sec at 64. The /data dir on the nodes is around 
~500GB at 44% usage. 
 
When shrinking the cluster, the ‘nodetool decommision’ was eventless. It 
completed successfully with no issues.
 
What could possibly cause repairs to c

Cluster Repairs 'nodetool repair -pr' Cause Severe Increase in Read Latency After Shrinking Cluster

2018-02-21 Thread Fd Habash
We have had a 15 node cluster across three zones and cluster repairs using 
‘nodetool repair -pr’ took about 3 hours to finish. Lately, we shrunk the 
cluster to 12. Since then, same repair job has taken up to 12 hours to finish 
and most times, it never does. 

More importantly, at some point during the repair cycle, we see read latencies 
jumping to 1-2 seconds and applications immediately notice the impact.

stream_throughput_outbound_megabits_per_sec is set at 200 and 
compaction_throughput_mb_per_sec at 64. The /data dir on the nodes is around 
~500GB at 44% usage. 

When shrinking the cluster, the ‘nodetool decommision’ was eventless. It 
completed successfully with no issues.

What could possibly cause repairs to cause this impact following cluster 
downsizing? Taking three nodes out does not seem compatible with such a drastic 
effect on repair and read latency. 

Any expert insights will be appreciated. 

Thank you



RE: When Replacing a Node, How to Force a Consistent Bootstrap

2017-12-14 Thread Fd Habash
“ … but it's better to repair before and after if possible …”

After, I simply run ‘nodetool repair –full’ on the replaced node. But before 
bootstrapping, if my cluster is distributed over 3 AZ’s, what do I repair? The 
entire other AZ’s? As one pointed out earlier, I can use ‘nodetool repair 
-hosts”, how do you identify what specific hosts to repair?

Thanks 


Thank you

From: Fd Habash
Sent: Thursday, December 7, 2017 12:09 PM
To: user@cassandra.apache.org
Subject: RE: When Replacing a Node, How to Force a Consistent Bootstrap

Thank you.

How do I identify what other 2 nodes the former downed node replicated with? A 
replica set of 3 nodes A,B,C. Now, C has been terminated by AWS and is gone. 
Using the getendpoints assumes knowing a partition key value, but how do you 
even know what key to use?

If there is a way to identify A and B, I, then, can simply run ‘nodetool 
repair’ to repair ALL the ranges on either.

Thanks 


Thank you

From: kurt greaves
Sent: Wednesday, December 6, 2017 6:45 PM
To: User
Subject: Re: When Replacing a Node, How to Force a Consistent Bootstrap

That's also an option but it's better to repair before and after if possible, 
if you don't repair beforehand you could end up missing some replicas until you 
repair after replacement, which could cause queries to return old/no data. 
Alternatively you could use ALL after replacing until the repair completes.

For example, A and C have replica a, A dies, on replace A streams the partition 
owning a from B, and thus is still inconsistent. QUORUM query hits A and B, and 
no results are returned for a.

On 5 December 2017 at 23:04, Fred Habash <fmhab...@gmail.com> wrote:
Or, do a full repair after bootstrapping completes?



On Dec 5, 2017 4:43 PM, "Jeff Jirsa" <jji...@gmail.com> wrote:
You cant ask cassandra to stream from the node with the "most recent data", 
because for some rows B may be most recent, and for others C may be most recent 
- you'd have to stream from both (which we don't support).

You'll need to repair (and you can repair before you do the replace to avoid 
the window of time where you violate consistency - use the -hosts option to 
allow repair with a down host, you'll repair A+C, so when B starts it'll 
definitely have all of the data).


On Tue, Dec 5, 2017 at 1:38 PM, Fd Habash <fmhab...@gmail.com> wrote:
Assume I have cluster of 3 nodes (A,B,C). Row x was written with CL=LQ to node 
A and B. Before it was written to C, node B crashes. I replaced B and it 
bootstrapped data from node C.
 
Now, row x is missing from C and B.  If node A crashes, it will be replaced and 
it will bootstrap from either C or B. As such, row x is now completely gone 
from the entire ring. 
 
Is this scenario possible at all (at least in C* < 3.0). 
 
How can a newly replaced node be forced to bootstrap from the node in the 
replica set that has the most recent data? 
 
Otherwise, we have to repair a node immediately after bootstrapping it for a 
node replacement.
 
Thank you
 






RE: When Replacing a Node, How to Force a Consistent Bootstrap

2017-12-07 Thread Fd Habash
Thank you.

How do I identify what other 2 nodes the former downed node replicated with? A 
replica set of 3 nodes A,B,C. Now, C has been terminated by AWS and is gone. 
Using the getendpoints assumes knowing a partition key value, but how do you 
even know what key to use?

If there is a way to identify A and B, I, then, can simply run ‘nodetool 
repair’ to repair ALL the ranges on either.

Thanks 


Thank you

From: kurt greaves
Sent: Wednesday, December 6, 2017 6:45 PM
To: User
Subject: Re: When Replacing a Node, How to Force a Consistent Bootstrap

That's also an option but it's better to repair before and after if possible, 
if you don't repair beforehand you could end up missing some replicas until you 
repair after replacement, which could cause queries to return old/no data. 
Alternatively you could use ALL after replacing until the repair completes.

For example, A and C have replica a, A dies, on replace A streams the partition 
owning a from B, and thus is still inconsistent. QUORUM query hits A and B, and 
no results are returned for a.

On 5 December 2017 at 23:04, Fred Habash <fmhab...@gmail.com> wrote:
Or, do a full repair after bootstrapping completes?



On Dec 5, 2017 4:43 PM, "Jeff Jirsa" <jji...@gmail.com> wrote:
You cant ask cassandra to stream from the node with the "most recent data", 
because for some rows B may be most recent, and for others C may be most recent 
- you'd have to stream from both (which we don't support).

You'll need to repair (and you can repair before you do the replace to avoid 
the window of time where you violate consistency - use the -hosts option to 
allow repair with a down host, you'll repair A+C, so when B starts it'll 
definitely have all of the data).


On Tue, Dec 5, 2017 at 1:38 PM, Fd Habash <fmhab...@gmail.com> wrote:
Assume I have cluster of 3 nodes (A,B,C). Row x was written with CL=LQ to node 
A and B. Before it was written to C, node B crashes. I replaced B and it 
bootstrapped data from node C.
 
Now, row x is missing from C and B.  If node A crashes, it will be replaced and 
it will bootstrap from either C or B. As such, row x is now completely gone 
from the entire ring. 
 
Is this scenario possible at all (at least in C* < 3.0). 
 
How can a newly replaced node be forced to bootstrap from the node in the 
replica set that has the most recent data? 
 
Otherwise, we have to repair a node immediately after bootstrapping it for a 
node replacement.
 
Thank you
 





When Replacing a Node, How to Force a Consistent Bootstrap

2017-12-05 Thread Fd Habash
Assume I have cluster of 3 nodes (A,B,C). Row x was written with CL=LQ to node 
A and B. Before it was written to C, node B crashes. I replaced B and it 
bootstrapped data from node C.

Now, row x is missing from C and B.  If node A crashes, it will be replaced and 
it will bootstrap from either C or B. As such, row x is now completely gone 
from the entire ring. 

Is this scenario possible at all (at least in C* < 3.0). 

How can a newly replaced node be forced to bootstrap from the node in the 
replica set that has the most recent data? 

Otherwise, we have to repair a node immediately after bootstrapping it for a 
node replacement.

Thank you



Replacing a Seed Node

2017-08-03 Thread Fd Habash
Hi all …
I know there is plenty of docs on how to replace a seed node, but some are 
steps are contradictory e.g. need to remote the node from seed list for entire 
cluster. 

My cluster has 6 nodes with 3 seeds running C* 2.8. One seed node was 
terminated by AWS. 

I came up with this procedure. Did I miss anything …

1) Remove the node (decomm or removenode) based on its current status
2) Remove the node from its own seed list
a. No need to remove it from other nodes. My cluster has 3 seeds
3) Restart C* with auto_bootstrap = true
4) Once autobootstrap is done, re-add the node as seed in its own 
Cassandra.yaml again
5) Restart C* on this node
6) No need to restart other nodes in the cluster



Thank you



Sync Spark Data with Cassandra Using Incremental Data Loading

2017-07-19 Thread Fd Habash
I have a scenario where data has to be loaded into Spark nodes from two data 
stores: Oracle and Cassandra. We did the initial loading of data and found a 
way to do daily incremental loading from Oracle to Spark. 

I’m tying to figure our how to do this from C*. What tools are available in C* 
to do incremental backup/restore/load?

Thanks 


Constant MemtableFlushWriter Messages Following upgrade from 2.2.5 to 2.2.8

2017-04-12 Thread Fd Habash
In the process of upgrading our cluster. Nodes that go upgraded are constantly 
emitting these messages. No impact, but wanted to know what they mean and why 
after the upgrade only.

Any feedback will be appreciated. 


17-04-10 20:18:11,580 Memtable.java:352 - Writing 
Memtable-compactions_in_progress@748675126(0.008KiB serialized bytes, 1 ops, 
0%/0% 
of on/off-heap limit)
INFO  [MemtableFlushWriter:1] 2017-04-10 20:18:11,588 Memtable.java:352 - 
Writing Memtable-compactions_in_progress@1129449190(0.195KiB serialized bytes, 
12 ops, 0%/0
% of on/off-heap limit)
INFO  [MemtableFlushWriter:2] 2017-04-10 20:18:14,426 Memtable.java:352 - 
Writing Memtable-compactions_in_progress@931709037(0.008KiB serialized bytes, 1 
ops, 0%/0% 
of on/off-heap limit)
INFO  [MemtableFlushWriter:1] 2017-04-10 20:18:44,950 Memtable.java:352 - 
Writing Memtable-compactions_in_progress@1057180976(0.008KiB serialized bytes, 
1 ops, 0%/0%
 of on/off-heap limit)
INFO  [MemtableFlushWriter:2] 2017-04-10 20:18:44,963 Memtable.java:352 - 
Writing Memtable-compactions_in_progress@2110307908(0.195KiB serialized bytes, 
12 ops, 0%/0
% of on/off-heap limit)
INFO  [MemtableFlushWriter:1] 2017-04-10 20:18:45,546 Memtable.java:352 - 
Writing Memtable-compactions_in_progress@1803704247(0.008KiB serialized bytes, 
1 ops, 0%/0%
 of on/off-heap limit)
INFO  [MemtableFlushWriter:2] 2017-04-10 20:19:16,196 Memtable.java:352 - 
Writing Memtable-compactions_in_progress@1692030234(0.008KiB serialized bytes, 
1 ops, 0%/0%
 of on/off-heap limit)
INFO  [MemtableFlushWriter:1] 2017-04-10 20:19:16,240 Memtable.java:352 - 
Writing Memtable-compactions_in_progress@12532575(0.098KiB serialized bytes, 6 
ops, 0%/0% o
f on/off-heap limit)
INFO  [MemtableFlushWriter:2] 2017-04-10 20:19:16,241 Memtable.java:352 - 
Writing Memtable-compactions_in_progress@337283565(0.098KiB serialized bytes, 6 
ops, 0%/0% 
of on/off-heap limit)
INFO  [MemtableFlushWriter:1] 2017-04-10 20:19:52,322 Memtable.java:352 - 
Writing Memtable-compactions_in_progress@810846450(0.008KiB serialized bytes, 1 
ops, 0%/0% 
of on/off-heap limit)
INFO  [MemtableFlushWriter:2] 2017-04-10 20:19:52,561 Memtable.java:352 - 
Writing Memtable-compactions_in_progress@2010893318(0.008KiB serialized bytes, 
1 ops, 0%/0%
 of on/off-heap limit)


Thank you