Running Node Repair After Changing RF or Replication Strategy for a Keyspace
Hi all … The datastax & apache docs are clear: run ‘nodetool repair’ after you alter a keyspace to change its RF or RS. However, the details are all over the place as what type of repair and on what nodes it needs to run. None of the above doc authorities are clear and what you find on the internet is quite contradictory. For example, this IBM doc suggest to run both the ‘alter keyspace’ and repair on EACH node affected or on ‘each node you need to change the RF on’. Others, suggest to run ‘repair -pr’. On a cluster of 1 DC and three racks, this is how I understand it …. 1. Run the ‘alter keyspace’ on a SINGLE node. 2. As for repairing the altered keyspac, I assume there are two options … a. Run ‘repair -full [key_space]’ on all nodes in all racks b. Run ‘repair -pr -full [keyspace] on all nodes in all racks Sounds correct? Thank you
Is There a Way To Proactively Monitor Reads Returning No Data Due to Consistency Level?
Typically, when a read is submitted to C*, it may complete with … 1. No errors & returns expected data 2. Errors out with UnavailableException 3. No error & returns zero rows on first attempt, but returned on subsequent runs. The third scenario happens as a result of cluster entropy specially during unexpected outages affecting on-premise or cloud infrastructures. Typical scenario … a) Multiple nodes fail in the cluster b) Node replaced via bootstrapping c) Row is in Cassandra, but client hits nodes that do not have the data yet. Gets zero rows. Row is retrieved on third or forth attempts and read repairs takes care of it. d) Eventually, repair is run and issue is fixed. Digging in Cassandra metrics, I’ve found ‘cassandra.unavailables.count’. Looks like this metrics captures scenario ' UnavailableException’, however. I have also read the Yelp article describing a metric they called ‘underreplicated keyspaces’. These are keyspace ranges that will fail to satisfy reads/write at a certain CL due to insufficient endpoints. If my understanding is correct, this is also measuring scenario 2. Tying to find a metric to capture scenario 3 above. Is this possible at all? Thank you
CL=LQ, RF=3: Can a Write be Lost If Two Nodes ACK'ing it Die
C*: 2.2.8 Write CL = LQ Kspace RF = 3 Three racks A write gets received by node 1 in rack 1 at above specs. Node 1 (rack1) & node 2 (rack2) acknowledge it to the client. Within some unit of time, node 1 & 2 die. Either …. - Scenario 1: C* process death: Row did not make it to sstable (it is in commit log & was in memtable) - Scenario 2: Node death: row may be have made to sstable, but nodes are gone (will have to bootstrap to replace). Scenario 1: Row is not lost because once C* is restarted, commit log should replay the mutation. Scenario 2: row is gone forever? If these two nodes are replaced via bootstrapping, will they ever get the row back from node 3 (rack3) if the write ever made it there? Thank you
RE: Bootstrapping to Replace a Dead Node vs. Adding a NewNode:Consistency Guarantees
Appreciate your response. As for extending the cluster & keeping the default range movement = true, C* won’t allow me to bootstrap multiples nodes, anyway. But, the question I’m still posing and have not gotten an answer for, is if fix Cassandra-2434 disallows bootstrapping multiple nodes to extend the cluster (which I was able to test in my lab cluster), why did it allow to bootstrap multiple nodes in the process of replacing dead nodes (no range calc). This fix forces a node to boostrap from former owner. Is this still the case also when bootstrapping when replacing dead node. Thank you From: ZAIDI, ASAD A Sent: Wednesday, May 1, 2019 5:13 PM To: user@cassandra.apache.org Subject: RE: Bootstrapping to Replace a Dead Node vs. Adding a NewNode:Consistency Guarantees The article you mentioned here clearly says “For new users to Cassandra, the safest way to add multiple nodes into a cluster is to add them one at a time. Stay tuned as I will be following up with another post on bootstrapping.” When extending cluster it is indeed recommended to go slow & serially. Optionally you can use cassandra.consistent.rangemovement=false but you can run in getting over streamed data. Since you’re using release way newer when fixed introduced , I assumed you won’t see same behavior as described for the version which fix addresses. After adding node , if you won’t get consistent data, you query consistency level should be able to pull consistent data , given you can tolerate bit latency until your repair is complete – if you go by recommendation i.e. to add one node at a time – you’ll avoid all these nuances . From: Fd Habash [mailto:fmhab...@gmail.com] Sent: Wednesday, May 01, 2019 3:12 PM To: user@cassandra.apache.org Subject: RE: Bootstrapping to Replace a Dead Node vs. Adding a New Node:Consistency Guarantees Probably, I needed to be clearer in my inquiry …. I’m investigating a situation where our diagnostic data is telling us that C* has lost some of the application data. I mean, getsstables for the data returns zero on all nodes in all racks. The last pickle article below & Jeff Jirsa had described a situation where bootstrapping a node to extend the cluster can loose data if this new node bootstraps from a stale SECONDARY replica (node that was offline > hinted had-off window). This was fixed in cassandra-2434. http://thelastpickle.com/blog/2017/05/23/auto-bootstrapping-part1.html The article & the Jira above describe bootstrapping when extending a cluster. I understand replacing a dead node does not involve range movement, but will the above Jira fix prevent the bootstrap process when a replacing a dead node from using secondary replica? Thanks Thank you From: Fred Habash Sent: Wednesday, May 1, 2019 6:50 AM To: user@cassandra.apache.org Subject: Re: Bootstrapping to Replace a Dead Node vs. Adding a New Node:Consistency Guarantees Thank you. Range movement is one reason this is enforced when adding a new node. But, what about forcing a consistent bootstrap i.e. bootstrapping from primary owner of the range and not a secondary replica. How’s consistent bootstrap enforced when replacing a dead node. - Thank you. On Apr 30, 2019, at 7:40 PM, Alok Dwivedi wrote: When a new node joins the ring, it needs to own new token ranges. This should be unique to the new node and we don’t want to end up in a situation where two nodes joining simultaneously can own same range (and ideally evenly distributed). Cassandra has this 2 minute wait rule for gossip state to propagate before a node is added. But this on its does not guarantees that token ranges can’t overlap. See this ticket for more details https://issues.apache.org/jira/browse/CASSANDRA-7069 To overcome this issue, the approach was to only allow one node joining at a time. When you replace a dead node the new token range selection does not applies as the replacing node just owns the token ranges of the dead node. I think that’s why the restriction of only replacing one node at a time does not applies in this case. Thanks Alok Dwivedi Senior Consultant https://www.instaclustr.com/platform/ From: Fd Habash Reply-To: "user@cassandra.apache.org" Date: Wednesday, 1 May 2019 at 06:18 To: "user@cassandra.apache.org" Subject: Bootstrapping to Replace a Dead Node vs. Adding a New Node: Consistency Guarantees Reviewing the documentation & based on my testing, using C* 2.2.8, I was not able to extend the cluster by adding multiple nodes simultaneously. I got an error message … Other bootstrapping/leaving/moving nodes detected, cannot bootstrap while cassandra.consistent.rangemovement is true I understand this is to force a node to bootstrap from the former owner of the range when adding a node as part of extending the cluster. However, I was able to bootstrap multiple nodes to rep
RE: Bootstrapping to Replace a Dead Node vs. Adding a New Node:Consistency Guarantees
Probably, I needed to be clearer in my inquiry …. I’m investigating a situation where our diagnostic data is telling us that C* has lost some of the application data. I mean, getsstables for the data returns zero on all nodes in all racks. The last pickle article below & Jeff Jirsa had described a situation where bootstrapping a node to extend the cluster can loose data if this new node bootstraps from a stale SECONDARY replica (node that was offline > hinted had-off window). This was fixed in cassandra-2434. http://thelastpickle.com/blog/2017/05/23/auto-bootstrapping-part1.html The article & the Jira above describe bootstrapping when extending a cluster. I understand replacing a dead node does not involve range movement, but will the above Jira fix prevent the bootstrap process when a replacing a dead node from using secondary replica? Thanks Thank you From: Fred Habash Sent: Wednesday, May 1, 2019 6:50 AM To: user@cassandra.apache.org Subject: Re: Bootstrapping to Replace a Dead Node vs. Adding a New Node:Consistency Guarantees Thank you. Range movement is one reason this is enforced when adding a new node. But, what about forcing a consistent bootstrap i.e. bootstrapping from primary owner of the range and not a secondary replica. How’s consistent bootstrap enforced when replacing a dead node. - Thank you. On Apr 30, 2019, at 7:40 PM, Alok Dwivedi wrote: When a new node joins the ring, it needs to own new token ranges. This should be unique to the new node and we don’t want to end up in a situation where two nodes joining simultaneously can own same range (and ideally evenly distributed). Cassandra has this 2 minute wait rule for gossip state to propagate before a node is added. But this on its does not guarantees that token ranges can’t overlap. See this ticket for more details https://issues.apache.org/jira/browse/CASSANDRA-7069 To overcome this issue, the approach was to only allow one node joining at a time. When you replace a dead node the new token range selection does not applies as the replacing node just owns the token ranges of the dead node. I think that’s why the restriction of only replacing one node at a time does not applies in this case. Thanks Alok Dwivedi Senior Consultant https://www.instaclustr.com/platform/ From: Fd Habash Reply-To: "user@cassandra.apache.org" Date: Wednesday, 1 May 2019 at 06:18 To: "user@cassandra.apache.org" Subject: Bootstrapping to Replace a Dead Node vs. Adding a New Node: Consistency Guarantees Reviewing the documentation & based on my testing, using C* 2.2.8, I was not able to extend the cluster by adding multiple nodes simultaneously. I got an error message … Other bootstrapping/leaving/moving nodes detected, cannot bootstrap while cassandra.consistent.rangemovement is true I understand this is to force a node to bootstrap from the former owner of the range when adding a node as part of extending the cluster. However, I was able to bootstrap multiple nodes to replace dead nodes. C* did not complain about it. Is consistent range movement & the guarantee it offers to bootstrap from primary range owner not applicable when bootstrapping to replace dead nodes? Thank you
Bootstrapping to Replace a Dead Node vs. Adding a New Node: Consistency Guarantees
Reviewing the documentation & based on my testing, using C* 2.2.8, I was not able to extend the cluster by adding multiple nodes simultaneously. I got an error message … Other bootstrapping/leaving/moving nodes detected, cannot bootstrap while cassandra.consistent.rangemovement is true I understand this is to force a node to bootstrap from the former owner of the range when adding a node as part of extending the cluster. However, I was able to bootstrap multiple nodes to replace dead nodes. C* did not complain about it. Is consistent range movement & the guarantee it offers to bootstrap from primary range owner not applicable when bootstrapping to replace dead nodes? Thank you
RE: A keyspace with RF=3, Cluster with 3 RACS, CL=LQ: No Data on FirstAttempt, but 1 Row Aftwards
Any ideas, please? Thank you From: Fd Habash Sent: Tuesday, April 23, 2019 10:38 AM To: user@cassandra.apache.org Subject: A keyspace with RF=3, Cluster with 3 RACS, CL=LQ: No Data on FirstAttempt, but 1 Row Aftwards Cluster setup … - C* 2.2.8 - Three RACs, one DC - Keyspace with RF=3 - RS = Network topology At CL=LQ … I get zero rows on first attempt, and one row on the second or third. Once found, I always get the row afterwards. Trying to understand this behavior … First attempt, my read request hits a RAC that simply does not have the data. Subsequent attempts hit another RAC that has it which triggers a read repair causing the row to be returned consistently afterwards. Is this correct? If a coordinator picks a node in the same RAC and the node does not have the data on disk, is it going to stop there and return nothing even though the row does exist on another RAC? If anti-entropy repair has completed successfully on the entire cluster ‘repair -pr’, why is this still happening? Thank you
A keyspace with RF=3, Cluster with 3 RACS, CL=LQ: No Data on First Attempt, but 1 Row Aftwards
Cluster setup … - C* 2.2.8 - Three RACs, one DC - Keyspace with RF=3 - RS = Network topology At CL=LQ … I get zero rows on first attempt, and one row on the second or third. Once found, I always get the row afterwards. Trying to understand this behavior … First attempt, my read request hits a RAC that simply does not have the data. Subsequent attempts hit another RAC that has it which triggers a read repair causing the row to be returned consistently afterwards. Is this correct? If a coordinator picks a node in the same RAC and the node does not have the data on disk, is it going to stop there and return nothing even though the row does exist on another RAC? If anti-entropy repair has completed successfully on the entire cluster ‘repair -pr’, why is this still happening? Thank you
RE: Cannot replace_address /10.xx.xx.xx because it doesn't exist ingossip
I can conclusively say, none of these commands were run. However, I think this is the likely scenario … If you have a cluster of three nodes 1,2,3 … - If 3 shows as DN - Restart C* on 1 & 2 - Nodetool status should NOT show node 3 IP at all. Restarting the cluster while a node is down resets gossip state. There is a good chance this is what happened. Plausible? Thank you From: Jeff Jirsa Sent: Thursday, March 14, 2019 11:06 AM To: cassandra Subject: Re: Cannot replace_address /10.xx.xx.xx because it doesn't exist ingossip Two things that wouldn't be a bug: You could have run removenode You could have run assassinate Also could be some new bug, but that's much less likely. On Thu, Mar 14, 2019 at 2:50 PM Fd Habash wrote: I have a node which I know for certain was a cluster member last week. It showed in nodetool status as DN. When I attempted to replace it today, I got this message ERROR [main] 2019-03-14 14:40:49,208 CassandraDaemon.java:654 - Exception encountered during startup java.lang.RuntimeException: Cannot replace_address /10.xx.xx.xxx.xx because it doesn't exist in gossip at org.apache.cassandra.service.StorageService.prepareReplacementInfo(StorageService.java:449) ~[apache-cassandra-2.2.8.jar:2.2.8] DN 10.xx.xx.xx 388.43 KB 256 6.9% bdbd632a-bf5d-44d4-b220-f17f258c4701 1e Under what conditions does this happen? Thank you
Cannot replace_address /10.xx.xx.xx because it doesn't exist in gossip
I have a node which I know for certain was a cluster member last week. It showed in nodetool status as DN. When I attempted to replace it today, I got this message ERROR [main] 2019-03-14 14:40:49,208 CassandraDaemon.java:654 - Exception encountered during startup java.lang.RuntimeException: Cannot replace_address /10.xx.xx.xxx.xx because it doesn't exist in gossip at org.apache.cassandra.service.StorageService.prepareReplacementInfo(StorageService.java:449) ~[apache-cassandra-2.2.8.jar:2.2.8] DN 10.xx.xx.xx 388.43 KB 256 6.9% bdbd632a-bf5d-44d4-b220-f17f258c4701 1e Under what conditions does this happen? Thank you
Loss of an Entire AZ in a Three-AZ Cassandra Cluster
Assume you have a 30 node cluster distributed across three AZ’s with an RF of 3. Trying to come up with a runbook to manage multi-nodes failure as a result of … - Loss of an entire AZ1 - Loss of multiple nodes in AZ2 - AZ3 unaffected. No node loss Is this is most optimal plan. Replacing dead nodes via bootstrapping … 1. Replace seeds nodes first (via bootstrap) 2. Bootstrap the few nodes in AZ2 3. Bootstrap all nodes in AZ1 4. Run a cluster repair. Do you wait to bootstrap everything before running repair or do you repair per node? Did I miss anything? Thank you
Migrating to Reaper: Switching From Incremental to Reaper's Full Subrange Repair
For those who are using Reaper … Currently, I'm run repairs using crontab/nodetool using 'repair -pr' on 2.2.8 which defaults to incremental. If I migrate to Reaper, do I have to mark sstables as un-repaired first? Also, out of the box, does Reaper run full parallel repair? If yes, is it not going to cause over-streaming since we are repairing ranges multiple times? Thank you
RE: Read Latency Doubles After Shrinking Cluster and Never Recovers
I will check for both. On a different subject, I have read some user testimonies that running ‘nodetool cleanup’ requires a C* process reboot at least around 2.2.8. Is this true? Thank you From: Nitan Kainth Sent: Monday, June 11, 2018 10:40 AM To: user@cassandra.apache.org Subject: Re: Read Latency Doubles After Shrinking Cluster and Never Recovers I think it would because it Cassandra will process more sstables to create response to read queries. Now after clean if the data volume is same and compaction has been running, I can’t think of any more diagnostic step. Let’s wait for other experts to comment. Can you also check sstable count for each table just to be sure that they are not extraordinarily high? Sent from my iPhone On Jun 11, 2018, at 10:21 AM, Fd Habash wrote: Yes we did after adding the three nodes back and a full cluster repair as well. But even it we didn’t run cleanup, would it have impacted read latency the fact that some nodes still have sstables that they no longer need? Thanks Thank you From: Nitan Kainth Sent: Monday, June 11, 2018 10:18 AM To: user@cassandra.apache.org Subject: Re: Read Latency Doubles After Shrinking Cluster and Never Recovers Did you run cleanup too? On Mon, Jun 11, 2018 at 10:16 AM, Fred Habash wrote: I have hit dead-ends every where I turned on this issue. We had a 15-node cluster that was doing 35 ms all along for years. At some point, we made a decision to shrink it to 13. Read latency rose to near 70 ms. Shortly after, we decided this was not acceptable, so we added the three nodes back in. Read latency dropped to near 50 ms and it has been hovering around this value for over 6 months now. Repairs run regularly, load on cluster nodes is even, application activity profile has not changed. Why are we unable to get back the same read latency now that the cluster is 15 nodes large same as it was before? -- Thank you
RE: Read Latency Doubles After Shrinking Cluster and Never Recovers
Yes we did after adding the three nodes back and a full cluster repair as well. But even it we didn’t run cleanup, would it have impacted read latency the fact that some nodes still have sstables that they no longer need? Thanks Thank you From: Nitan Kainth Sent: Monday, June 11, 2018 10:18 AM To: user@cassandra.apache.org Subject: Re: Read Latency Doubles After Shrinking Cluster and Never Recovers Did you run cleanup too? On Mon, Jun 11, 2018 at 10:16 AM, Fred Habash wrote: I have hit dead-ends every where I turned on this issue. We had a 15-node cluster that was doing 35 ms all along for years. At some point, we made a decision to shrink it to 13. Read latency rose to near 70 ms. Shortly after, we decided this was not acceptable, so we added the three nodes back in. Read latency dropped to near 50 ms and it has been hovering around this value for over 6 months now. Repairs run regularly, load on cluster nodes is even, application activity profile has not changed. Why are we unable to get back the same read latency now that the cluster is 15 nodes large same as it was before? -- Thank you
RE: Cassandra upgrade from 2.2.8 to 3.10
Thank you. In regards to my second inquiry, as we plan for C* upgrades, I did not find the NEWS.txt always to be telling of possible upgrade paths. Is there a rule of thumb or may be an official reference for upgrade paths? Thank you From: Alexander Dejanovski Sent: Wednesday, March 28, 2018 1:58 PM To: user@cassandra.apache.org Subject: Re: Cassandra upgrade from 2.2.8 to 3.10 You can perform an upgrade from 2.2.x straight to 3.11.2, but the op suggests adding nodes in 3.10 to a cluster that runs 2.2.8, which is why Jeff says it won't work. I see no reason to upgrade to 3.10 and not 3.11.2 by the way. On Wed, Mar 28, 2018 at 5:10 PM Fred Habashwrote: Hi ... I'm finding anecdotal evidence on the internet that we are able to upgrade 2.2.8 to latest 3.11.2. Post below indicates that you can upgrade to latest 3.x from 2.1.9 because 3.x no longer requires 'structured upgrade path'. I just want to confirm that such upgrade is supported. If yes, where can I find official documentation showing upgrade path across releases. https://stackoverflow.com/questions/42094935/apache-cassandra-upgrade-3-x-from-2-1 Thanks On Mon, Aug 7, 2017 at 5:58 PM, ZAIDI, ASAD A wrote: Hi folks, I’ve question on upgrade method I’m thinking to execute. I’m planning from apache-Cassandra 2.2.8 to release 3.10. My Cassandra cluster is configured like one rack with two Datacenters like: 1. DC1 has 4 nodes 2. DC2 has 16 nodes We’re adding another 12 nodes and would eventually need to remove those 4 nodes in DC1. I’m thinking to add another third data center with like DC3 with 12 nodes having apache Cassandra 3.10 installed. Then, I start upgrading seed nodes first in DC1 & DC2 – once all 20nodes in ( DC1 plus DC2) upgraded – I can safely remove 4 DC1 nodes, can you guys please let me know if this approach would work? I’m concerned if having mixed version on Cassandra nodes may cause any issues like in streaming data/sstables from existing DC to newly created third DC with version 3.10 installed, will nodes in DC3 join the cluster with data without issues? Thanks/Asad -- Thank you ... Fred Habash, Database Solutions Architect (Oracle OCP 8i,9i,10g,11g) -- - Alexander Dejanovski France @alexanderdeja Consultant Apache Cassandra Consulting http://www.thelastpickle.com
RE: On a 12-node Cluster, Starting C* on a Seed Node Increases ReadLatency from 150ms to 1.5 sec.
I understand you use Apache Cassandra 2.2.8. :) - Yes. It was a typo In Apache Cassandra 2.2.8, this triggers incremental repairs I believe, - Yes, default as of 2.2 and using primary range which repairs runs on every node in the cluster . Did you replace the node in-place? - Yes. We removed from its seed provider list. Otherwise, it won’t bootstrap. . You should be able to have nodes going down, or being fairly slow … - When we stopped C* on this node, read performance recovered well. Once started, and now with no repairs running at all, latency increased bad to over 1.5 secs. This affected the node (in AZ 1) and the other 8 nodes ( 4 in AZ 2 and 4 in AZ 3). That is, it slowed down the other 2 AZ’s. - The application reads with CL=LQ - This behavior I do not understand. There is no streaming. My coworker Alexander wrote about this a few month ago, i - We have been looking into Reaper for past 2 months. Work in progress. And thank you for the thorough response. From: Alain RODRIGUEZ Sent: Friday, March 2, 2018 11:43 AM To: user cassandra.apache.org Subject: Re: On a 12-node Cluster, Starting C* on a Seed Node Increases ReadLatency from 150ms to 1.5 sec. Hello, This is a 2.8.8. cluster That's an exotic version! I understand you use Apache Cassandra 2.2.8. :) This single node was a seed node and it was running a ‘repair -pr’ at the time In Apache Cassandra 2.2.8, this triggers incremental repairs I believe, and they are relatively (some would say completely) broken. Let's say they caused a lot of troubles in many cases. If I am wrong and you are not running incremental repairs (default in your version off the top of my head) then you node might not have enough resource available to handle both the repair and the standard load. It might be something to check. Consequences of incremental repairs are: - Keeping SSTables split between repaired and not repaired table, increasing the number of SSTable - Anti-compaction (splits SSTables) is used to keep them grouped. This induces a lot of performances downsides such as (but not only): - inefficient tombstone eviction - More disk hit for the same queries - More compaction work Machine are then performing very poorly. My coworker Alexander wrote about this a few month ago, it might be of interest: http://thelastpickle.com/blog/2017/12/14/should-you-use-incremental-repair.html If repairs are a pain point, you might be interested in checking http://cassandra-reaper.io/, that aims at making this operation easier and more efficient. I would say the fact this node is a seed nodes did not impact here, it is a coincidence due to the fact you picked a seed for the repair. Seed nodes are mostly working as any other node, excepted during bootstrap. So we decided to bootstrap it. I am not sure what happen when bootstrapping a seed node. I always removed it from the seed list first. Did you replace the node in-place? I guess if you had no warning and have no consistency issues, it's all good. All we were able to see is that the seed node in question was different in that it had 5000 sstables while all others had around 2300. After bootstrap, seed node sstables reduced to 2500. I would say this is fairly common (even more when using vnodes) as streaming of the data from all the other nodes is fast and compaction might take a while to catch up. Why would starting C* on a single seed node affect the cluster this bad? That's a fair question. It depends on factors such as the client configuration, the replication factor, the consistency level used. If the node is involved in some reads, then the average latency will drop. You should be able to have nodes going down, or being fairly slow and use the right nodes if the client is recent enough and well configured. Is it gossip? It might be, there were issues, but I believe in previous versions and / or on bigger cluster. I would dig for a 'repair' issue first, it seems more probable to me. I hope this helped, C*heers, --- Alain Rodriguez - @arodream - al...@thelastpickle.com France / Spain The Last Pickle - Apache Cassandra Consulting http://www.thelastpickle.com 2018-03-02 14:42 GMT+00:00 Fd Habash <fmhab...@gmail.com>: This is a 2.8.8. cluster with three AWS AZs, each with 4 nodes. Few days ago we noticed a single node’s read latency reaching 1.5 secs there was 8 others with read latencies going up near 900 ms. This single node was a seed node and it was running a ‘repair -pr’ at the time. We intervened as follows … • Stopping compactions during repair did not improve latency. • Killing repair brought down latency to 200 ms on the seed node and the other 8. • Restarting C* on the seed node increased latency again back to near 1.5 secs on the seed and other 8. At this point, there was no repair running and compactions were running. We left them alone. At this point, we saw that putting the seed node back in the cluster consis
On a 12-node Cluster, Starting C* on a Seed Node Increases Read Latency from 150ms to 1.5 sec.
This is a 2.8.8. cluster with three AWS AZs, each with 4 nodes. Few days ago we noticed a single node’s read latency reaching 1.5 secs there was 8 others with read latencies going up near 900 ms. This single node was a seed node and it was running a ‘repair -pr’ at the time. We intervened as follows … • Stopping compactions during repair did not improve latency. • Killing repair brought down latency to 200 ms on the seed node and the other 8. • Restarting C* on the seed node increased latency again back to near 1.5 secs on the seed and other 8. At this point, there was no repair running and compactions were running. We left them alone. At this point, we saw that putting the seed node back in the cluster consistently worsened latencies on seed and 8 nodes = 9 out of the 12 nodes in the cluster. So we decided to bootstrap it. During the bootstrapping and afterwards, latencies remained near 200 ms which is what we wanted for now. All we were able to see is that the seed node in question was different in that it had 5000 sstables while all others had around 2300. After bootstrap, seed node sstables reduced to 2500. Why would starting C* on a single seed node affect the cluster this bad? Again, no repair just 4 compactions that run routinely on it as well all others. Is it gossip? What other plausible explanations are there? Thank you
RE: Cluster Repairs 'nodetool repair -pr' Cause Severe IncreaseinRead Latency After Shrinking Cluster
One more observation … When we compare read latencies between non-prod (where nodes were removed) to prod clusters, even though the node load as measure by size of /data dir is similar, yet the read latencies are 5 times slower in the downsized non-prod cluster. The only difference we see is that prod reads from 4 sstables whereas non-prod reads from 5 as cfhistograms. Non-prod /data size - Filesystem Size Used Avail Use% Mounted on /dev/nvme0n1885G 454G 432G 52% /data Filesystem Size Used Avail Use% Mounted on /dev/nvme0n1885G 439G 446G 50% /data Filesystem Size Used Avail Use% Mounted on /dev/nvme0n1885G 368G 518G 42% /data Filesystem Size Used Avail Use% Mounted on /dev/nvme0n1885G 431G 455G 49% /data Filesystem Size Used Avail Use% Mounted on /dev/nvme0n1885G 463G 423G 53% /data Filesystem Size Used Avail Use% Mounted on /dev/nvme0n1885G 406G 479G 46% /data Filesystem Size Used Avail Use% Mounted on /dev/nvme0n1885G 419G 466G 48% /data Filesystem Size Used Avail Use% Mounted on Prod /data size Filesystem Size Used Avail Use% Mounted on /dev/nvme0n1885G 352G 534G 40% /data Filesystem Size Used Avail Use% Mounted on /dev/nvme0n1885G 423G 462G 48% /data Filesystem Size Used Avail Use% Mounted on /dev/nvme0n1885G 431G 454G 49% /data Filesystem Size Used Avail Use% Mounted on /dev/nvme0n1885G 442G 443G 50% /data Filesystem Size Used Avail Use% Mounted on /dev/nvme0n1885G 454G 431G 52% /data Cfhistograms: comparing prod to non-prod - Non-prod -- 08:21:38Percentile SSTables Write Latency Read LatencyPartition SizeCell Count 08:21:38 (micros) (micros) (bytes) 08:21:3850% 1.00 24.60 4055.27 11864 4 08:21:3875% 2.00 35.43 14530.76 17084 4 08:21:3895% 4.00126.93 89970.66 35425 4 08:21:3898% 5.00219.34 155469.30 73457 4 08:21:3899% 5.00219.34 186563.16105778 4 08:21:38Min 0.00 5.72 17.0987 3 08:21:38Max 7.00 20924.30 1386179.89 14530764 4 Prod --- 07:41:42Percentile SSTables Write Latency Read LatencyPartition SizeCell Count 07:41:42 (micros) (micros) (bytes) 07:41:4250% 1.00 24.60 2346.80 11864 4 07:41:4275% 2.00 29.52 4866.32 17084 4 07:41:4295% 3.00 73.46 14530.76 29521 4 07:41:4298% 4.00182.79 25109.16 61214 4 07:41:4299% 4.00182.79 36157.19 88148 4 07:41:42Min 0.00 9.89 20.5087 0 07:41:42Max 5.00219.34 155469.30 12108970 4 Thank you From: Fd Habash Sent: Thursday, February 22, 2018 9:00 AM To: user@cassandra.apache.org Subject: RE: Cluster Repairs 'nodetool repair -pr' Cause Severe IncreaseinRead Latency After Shrinking Cluster “ data was allowed to fully rebalance/repair/drain before the next node was taken off?” -- Judging by the messages, the decomm was healthy. As an example StorageService.java:3425 - Announcing that I have left the ring for 3ms … INFO [RMI TCP Connection(4)-127.0.0.1] 2016-01-07 06:00:52,662 StorageService.java:1191 – DECOMMISSIONED I do not believe repairs were run after each node removal. I’ll double-check. I’m not sure what you mean by ‘rebalance’? How do you check if a node is balanced? Load/size of data dir? As for the drain, there was no need to drain and I believe it is not something you do as part of decomm’ing a node. did you take 1 off per rack/AZ? -- We removed 3 nodes, one from each AZ in sequence These are some
RE: Cluster Repairs 'nodetool repair -pr' Cause Severe Increase inRead Latency After Shrinking Cluster
“ data was allowed to fully rebalance/repair/drain before the next node was taken off?” -- Judging by the messages, the decomm was healthy. As an example StorageService.java:3425 - Announcing that I have left the ring for 3ms … INFO [RMI TCP Connection(4)-127.0.0.1] 2016-01-07 06:00:52,662 StorageService.java:1191 – DECOMMISSIONED I do not believe repairs were run after each node removal. I’ll double-check. I’m not sure what you mean by ‘rebalance’? How do you check if a node is balanced? Load/size of data dir? As for the drain, there was no need to drain and I believe it is not something you do as part of decomm’ing a node. did you take 1 off per rack/AZ? -- We removed 3 nodes, one from each AZ in sequence These are some of the cfhistogram metrics. Read latencies are high after the removal of the nodes -- You can see reads of 186ms are at the 99th% from 5 sstables. There are awfully high numbers given that these metrics measure C* storage layer read performance. Does this mean removing the nodes undersized the cluster? key_space_01/cf_01 histograms Percentile SSTables Write Latency Read LatencyPartition Size Cell Count (micros) (micros) (bytes) 50% 1.00 24.60 4055.27 11864 4 75% 2.00 35.43 14530.76 17084 4 95% 4.00126.93 89970.66 35425 4 98% 5.00219.34 155469.30 73457 4 99% 5.00219.34 186563.16105778 4 Min 0.00 5.72 17.0987 3 Max 7.00 20924.301386179.89 14530764 4 key_space_01/cf_01 histograms Percentile SSTables Write Latency Read LatencyPartition Size Cell Count (micros) (micros) (bytes) 50% 1.00 29.52 4055.27 11864 4 75% 2.00 42.51 10090.81 17084 4 95% 4.00152.32 52066.35 35425 4 98% 4.00219.34 89970.66 73457 4 99% 5.00219.34 155469.30 88148 4 Min 0.00 9.89 24.6087 0 Max 6.00 1955.67 557074.61 14530764 4 Thank you From: Carl Mueller Sent: Wednesday, February 21, 2018 4:33 PM To: user@cassandra.apache.org Subject: Re: Cluster Repairs 'nodetool repair -pr' Cause Severe Increase inRead Latency After Shrinking Cluster Hm nodetool decommision performs the streamout of the replicated data, and you said that was apparently without error... But if you dropped three nodes in one AZ/rack on a five node with RF3, then we have a missing RF factor unless NetworkTopologyStrategy fails over to another AZ. But that would also entail cross-az streaming and queries and repair. On Wed, Feb 21, 2018 at 3:30 PM, Carl Mueller <carl.muel...@smartthings.com> wrote: sorry for the idiot questions... data was allowed to fully rebalance/repair/drain before the next node was taken off? did you take 1 off per rack/AZ? On Wed, Feb 21, 2018 at 12:29 PM, Fred Habash <fmhab...@gmail.com> wrote: One node at a time On Feb 21, 2018 10:23 AM, "Carl Mueller" <carl.muel...@smartthings.com> wrote: What is your replication factor? Single datacenter, three availability zones, is that right? You removed one node at a time or three at once? On Wed, Feb 21, 2018 at 10:20 AM, Fd Habash <fmhab...@gmail.com> wrote: We have had a 15 node cluster across three zones and cluster repairs using ‘nodetool repair -pr’ took about 3 hours to finish. Lately, we shrunk the cluster to 12. Since then, same repair job has taken up to 12 hours to finish and most times, it never does. More importantly, at some point during the repair cycle, we see read latencies jumping to 1-2 seconds and applications immediately notice the impact. stream_throughput_outbound_megabits_per_sec is set at 200 and compaction_throughput_mb_per_sec at 64. The /data dir on the nodes is around ~500GB at 44% usage. When shrinking the cluster, the ‘nodetool decommision’ was eventless. It completed successfully with no issues. What could possibly cause repairs to c
Cluster Repairs 'nodetool repair -pr' Cause Severe Increase in Read Latency After Shrinking Cluster
We have had a 15 node cluster across three zones and cluster repairs using ‘nodetool repair -pr’ took about 3 hours to finish. Lately, we shrunk the cluster to 12. Since then, same repair job has taken up to 12 hours to finish and most times, it never does. More importantly, at some point during the repair cycle, we see read latencies jumping to 1-2 seconds and applications immediately notice the impact. stream_throughput_outbound_megabits_per_sec is set at 200 and compaction_throughput_mb_per_sec at 64. The /data dir on the nodes is around ~500GB at 44% usage. When shrinking the cluster, the ‘nodetool decommision’ was eventless. It completed successfully with no issues. What could possibly cause repairs to cause this impact following cluster downsizing? Taking three nodes out does not seem compatible with such a drastic effect on repair and read latency. Any expert insights will be appreciated. Thank you
RE: When Replacing a Node, How to Force a Consistent Bootstrap
“ … but it's better to repair before and after if possible …” After, I simply run ‘nodetool repair –full’ on the replaced node. But before bootstrapping, if my cluster is distributed over 3 AZ’s, what do I repair? The entire other AZ’s? As one pointed out earlier, I can use ‘nodetool repair -hosts”, how do you identify what specific hosts to repair? Thanks Thank you From: Fd Habash Sent: Thursday, December 7, 2017 12:09 PM To: user@cassandra.apache.org Subject: RE: When Replacing a Node, How to Force a Consistent Bootstrap Thank you. How do I identify what other 2 nodes the former downed node replicated with? A replica set of 3 nodes A,B,C. Now, C has been terminated by AWS and is gone. Using the getendpoints assumes knowing a partition key value, but how do you even know what key to use? If there is a way to identify A and B, I, then, can simply run ‘nodetool repair’ to repair ALL the ranges on either. Thanks Thank you From: kurt greaves Sent: Wednesday, December 6, 2017 6:45 PM To: User Subject: Re: When Replacing a Node, How to Force a Consistent Bootstrap That's also an option but it's better to repair before and after if possible, if you don't repair beforehand you could end up missing some replicas until you repair after replacement, which could cause queries to return old/no data. Alternatively you could use ALL after replacing until the repair completes. For example, A and C have replica a, A dies, on replace A streams the partition owning a from B, and thus is still inconsistent. QUORUM query hits A and B, and no results are returned for a. On 5 December 2017 at 23:04, Fred Habash <fmhab...@gmail.com> wrote: Or, do a full repair after bootstrapping completes? On Dec 5, 2017 4:43 PM, "Jeff Jirsa" <jji...@gmail.com> wrote: You cant ask cassandra to stream from the node with the "most recent data", because for some rows B may be most recent, and for others C may be most recent - you'd have to stream from both (which we don't support). You'll need to repair (and you can repair before you do the replace to avoid the window of time where you violate consistency - use the -hosts option to allow repair with a down host, you'll repair A+C, so when B starts it'll definitely have all of the data). On Tue, Dec 5, 2017 at 1:38 PM, Fd Habash <fmhab...@gmail.com> wrote: Assume I have cluster of 3 nodes (A,B,C). Row x was written with CL=LQ to node A and B. Before it was written to C, node B crashes. I replaced B and it bootstrapped data from node C. Now, row x is missing from C and B. If node A crashes, it will be replaced and it will bootstrap from either C or B. As such, row x is now completely gone from the entire ring. Is this scenario possible at all (at least in C* < 3.0). How can a newly replaced node be forced to bootstrap from the node in the replica set that has the most recent data? Otherwise, we have to repair a node immediately after bootstrapping it for a node replacement. Thank you
RE: When Replacing a Node, How to Force a Consistent Bootstrap
Thank you. How do I identify what other 2 nodes the former downed node replicated with? A replica set of 3 nodes A,B,C. Now, C has been terminated by AWS and is gone. Using the getendpoints assumes knowing a partition key value, but how do you even know what key to use? If there is a way to identify A and B, I, then, can simply run ‘nodetool repair’ to repair ALL the ranges on either. Thanks Thank you From: kurt greaves Sent: Wednesday, December 6, 2017 6:45 PM To: User Subject: Re: When Replacing a Node, How to Force a Consistent Bootstrap That's also an option but it's better to repair before and after if possible, if you don't repair beforehand you could end up missing some replicas until you repair after replacement, which could cause queries to return old/no data. Alternatively you could use ALL after replacing until the repair completes. For example, A and C have replica a, A dies, on replace A streams the partition owning a from B, and thus is still inconsistent. QUORUM query hits A and B, and no results are returned for a. On 5 December 2017 at 23:04, Fred Habash <fmhab...@gmail.com> wrote: Or, do a full repair after bootstrapping completes? On Dec 5, 2017 4:43 PM, "Jeff Jirsa" <jji...@gmail.com> wrote: You cant ask cassandra to stream from the node with the "most recent data", because for some rows B may be most recent, and for others C may be most recent - you'd have to stream from both (which we don't support). You'll need to repair (and you can repair before you do the replace to avoid the window of time where you violate consistency - use the -hosts option to allow repair with a down host, you'll repair A+C, so when B starts it'll definitely have all of the data). On Tue, Dec 5, 2017 at 1:38 PM, Fd Habash <fmhab...@gmail.com> wrote: Assume I have cluster of 3 nodes (A,B,C). Row x was written with CL=LQ to node A and B. Before it was written to C, node B crashes. I replaced B and it bootstrapped data from node C. Now, row x is missing from C and B. If node A crashes, it will be replaced and it will bootstrap from either C or B. As such, row x is now completely gone from the entire ring. Is this scenario possible at all (at least in C* < 3.0). How can a newly replaced node be forced to bootstrap from the node in the replica set that has the most recent data? Otherwise, we have to repair a node immediately after bootstrapping it for a node replacement. Thank you
When Replacing a Node, How to Force a Consistent Bootstrap
Assume I have cluster of 3 nodes (A,B,C). Row x was written with CL=LQ to node A and B. Before it was written to C, node B crashes. I replaced B and it bootstrapped data from node C. Now, row x is missing from C and B. If node A crashes, it will be replaced and it will bootstrap from either C or B. As such, row x is now completely gone from the entire ring. Is this scenario possible at all (at least in C* < 3.0). How can a newly replaced node be forced to bootstrap from the node in the replica set that has the most recent data? Otherwise, we have to repair a node immediately after bootstrapping it for a node replacement. Thank you
Replacing a Seed Node
Hi all … I know there is plenty of docs on how to replace a seed node, but some are steps are contradictory e.g. need to remote the node from seed list for entire cluster. My cluster has 6 nodes with 3 seeds running C* 2.8. One seed node was terminated by AWS. I came up with this procedure. Did I miss anything … 1) Remove the node (decomm or removenode) based on its current status 2) Remove the node from its own seed list a. No need to remove it from other nodes. My cluster has 3 seeds 3) Restart C* with auto_bootstrap = true 4) Once autobootstrap is done, re-add the node as seed in its own Cassandra.yaml again 5) Restart C* on this node 6) No need to restart other nodes in the cluster Thank you
Sync Spark Data with Cassandra Using Incremental Data Loading
I have a scenario where data has to be loaded into Spark nodes from two data stores: Oracle and Cassandra. We did the initial loading of data and found a way to do daily incremental loading from Oracle to Spark. I’m tying to figure our how to do this from C*. What tools are available in C* to do incremental backup/restore/load? Thanks
Constant MemtableFlushWriter Messages Following upgrade from 2.2.5 to 2.2.8
In the process of upgrading our cluster. Nodes that go upgraded are constantly emitting these messages. No impact, but wanted to know what they mean and why after the upgrade only. Any feedback will be appreciated. 17-04-10 20:18:11,580 Memtable.java:352 - Writing Memtable-compactions_in_progress@748675126(0.008KiB serialized bytes, 1 ops, 0%/0% of on/off-heap limit) INFO [MemtableFlushWriter:1] 2017-04-10 20:18:11,588 Memtable.java:352 - Writing Memtable-compactions_in_progress@1129449190(0.195KiB serialized bytes, 12 ops, 0%/0 % of on/off-heap limit) INFO [MemtableFlushWriter:2] 2017-04-10 20:18:14,426 Memtable.java:352 - Writing Memtable-compactions_in_progress@931709037(0.008KiB serialized bytes, 1 ops, 0%/0% of on/off-heap limit) INFO [MemtableFlushWriter:1] 2017-04-10 20:18:44,950 Memtable.java:352 - Writing Memtable-compactions_in_progress@1057180976(0.008KiB serialized bytes, 1 ops, 0%/0% of on/off-heap limit) INFO [MemtableFlushWriter:2] 2017-04-10 20:18:44,963 Memtable.java:352 - Writing Memtable-compactions_in_progress@2110307908(0.195KiB serialized bytes, 12 ops, 0%/0 % of on/off-heap limit) INFO [MemtableFlushWriter:1] 2017-04-10 20:18:45,546 Memtable.java:352 - Writing Memtable-compactions_in_progress@1803704247(0.008KiB serialized bytes, 1 ops, 0%/0% of on/off-heap limit) INFO [MemtableFlushWriter:2] 2017-04-10 20:19:16,196 Memtable.java:352 - Writing Memtable-compactions_in_progress@1692030234(0.008KiB serialized bytes, 1 ops, 0%/0% of on/off-heap limit) INFO [MemtableFlushWriter:1] 2017-04-10 20:19:16,240 Memtable.java:352 - Writing Memtable-compactions_in_progress@12532575(0.098KiB serialized bytes, 6 ops, 0%/0% o f on/off-heap limit) INFO [MemtableFlushWriter:2] 2017-04-10 20:19:16,241 Memtable.java:352 - Writing Memtable-compactions_in_progress@337283565(0.098KiB serialized bytes, 6 ops, 0%/0% of on/off-heap limit) INFO [MemtableFlushWriter:1] 2017-04-10 20:19:52,322 Memtable.java:352 - Writing Memtable-compactions_in_progress@810846450(0.008KiB serialized bytes, 1 ops, 0%/0% of on/off-heap limit) INFO [MemtableFlushWriter:2] 2017-04-10 20:19:52,561 Memtable.java:352 - Writing Memtable-compactions_in_progress@2010893318(0.008KiB serialized bytes, 1 ops, 0%/0% of on/off-heap limit) Thank you