Re: Assassinate fails
maybe read the ticket above etc. When you're ready, backup the data, > prepare well the DELETE command and observe how 1 node reacts to the fix > first. > > As you can see, I think it's the 'good' fix, but I'm not comfortable with > this operation. And you should not be either :). > I would say, arbitrary to share my feeling about this operation, that > there is 95% chances this does not hurt, 90% chances to fix the issue with > that, but if something goes wrong, if we are in the 5% were it does not go > well, there is a not negligible probability that you will destroy your > cluster in a very bad way. I guess I try to say be careful, watch your > step, make sure you remove the good line, ensure it works on one node with > no harm. > I shared my feeling and I would try this fix. But it's ultimately > your responsibility and I won't be behind the machine when you'll fix it. > None of us will. > > Good luck ! :) > > C*heers, > --- > Alain Rodriguez - al...@thelastpickle.com > France / Spain > > The Last Pickle - Apache Cassandra Consulting > http://www.thelastpickle.com > > > > Le jeu. 4 avr. 2019 à 19:29, Kenneth Brotman > a écrit : > >> Alex, >> >> According to this TLP article >> http://thelastpickle.com/blog/2018/09/18/assassinate.html : >> >> Note that the LEFT status should stick around for 72 hours to ensure all >> nodes come to the consensus that the node has been removed. So please don't >> rush things if that's the case. Again, it's only cosmetic. >> >> If a gossip state will not forget a node that was removed from the >> cluster more than a week ago: >> >> Login to each node within the Cassandra cluster. >> Download jmxterm on each node, if nodetool assassinate is not an >> option. >> Run nodetool assassinate, or the unsafeAssassinateEndpoint command, >> multiple times in quick succession. >> I typically recommend running the command 3-5 times within 2 >> seconds. >> I understand that sometimes the command takes time to return, so >> the "2 seconds" suggestion is less of a requirement than it is a mindset. >> Also, sometimes 3-5 times isn't enough. In such cases, shoot for >> the moon and try 20 assassination attempts in quick succession. >> >> What we are trying to do is to create a flood of messages requesting all >> nodes completely forget there used to be an entry within the gossip state >> for the given IP address. If each node can prune its own gossip state and >> broadcast that to the rest of the nodes, we should eliminate any race >> conditions that may exist where at least one node still remembers the given >> IP address. >> >> As soon as all nodes come to agreement that they don't remember the >> deprecated node, the cosmetic issue will no longer be a concern in any >> system.logs, nodetool describecluster commands, nor nodetool gossipinfo >> output. >> >> >> >> >> >> -Original Message- >> From: Kenneth Brotman [mailto:kenbrot...@yahoo.com.INVALID] >> Sent: Thursday, April 04, 2019 10:40 AM >> To: user@cassandra.apache.org >> Subject: RE: Assassinate fails >> >> Alex, >> >> Did you remove the option JVM_OPTS="$JVM_OPTS >> -Dcassandra.replace_address=address_of_dead_node after the node started and >> then restart the node again? >> >> Are you sure there isn't a typo in the file? >> >> Ken >> >> >> -Original Message- >> From: Kenneth Brotman [mailto:kenbrot...@yahoo.com.INVALID] >> Sent: Thursday, April 04, 2019 10:31 AM >> To: user@cassandra.apache.org >> Subject: RE: Assassinate fails >> >> I see; system_auth is a separate keyspace. >> >> -Original Message- >> From: Jon Haddad [mailto:j...@jonhaddad.com] >> Sent: Thursday, April 04, 2019 10:17 AM >> To: user@cassandra.apache.org >> Subject: Re: Assassinate fails >> >> No, it can't. As Alain (and I) have said, since the system keyspace >> is local strategy, it's not replicated, and thus can't be repaired. >> >> On Thu, Apr 4, 2019 at 9:54 AM Kenneth Brotman >> wrote: >> > >> > Right, could be similar issue, same type of fix though. >> > >> > -Original Message- >> > From: Jon Haddad [mailto:j...@jonhaddad.com] >> > Sent: Thursday, April 04, 2019 9:52 AM >> > To: user@cassandra.apache.org >> > Subject: Re: Assassinate fails >> > >> > System != system_auth. >> > >> > On Thu, Apr 4, 2019 at
Re: Assassinate fails
y to say be careful, watch your step, make sure you > remove the good line, ensure it works on one node with no harm. > I shared my feeling and I would try this fix. But it's ultimately your > responsibility and I won't be behind the machine when you'll fix it. None of > us will. > > Good luck ! :) > > C*heers, > > --- > Alain Rodriguez - al...@thelastpickle.com > France / Spain > > The Last Pickle - Apache Cassandra Consulting > http://www.thelastpickle.com > > Le jeu. 4 avr. 2019 à 19:29, Kenneth Brotman a > écrit : > >> Alex, >> >> According to this TLP article >> http://thelastpickle.com/blog/2018/09/18/assassinate.html : >> >> Note that the LEFT status should stick around for 72 hours to ensure all >> nodes come to the consensus that the node has been removed. So please don't >> rush things if that's the case. Again, it's only cosmetic. >> >> If a gossip state will not forget a node that was removed from the cluster >> more than a week ago: >> >> Login to each node within the Cassandra cluster. >> Download jmxterm on each node, if nodetool assassinate is not an option. >> Run nodetool assassinate, or the unsafeAssassinateEndpoint command, multiple >> times in quick succession. >> I typically recommend running the command 3-5 times within 2 seconds. >> I understand that sometimes the command takes time to return, so the "2 >> seconds" suggestion is less of a requirement than it is a mindset. >> Also, sometimes 3-5 times isn't enough. In such cases, shoot for the moon >> and try 20 assassination attempts in quick succession. >> >> What we are trying to do is to create a flood of messages requesting all >> nodes completely forget there used to be an entry within the gossip state >> for the given IP address. If each node can prune its own gossip state and >> broadcast that to the rest of the nodes, we should eliminate any race >> conditions that may exist where at least one node still remembers the given >> IP address. >> >> As soon as all nodes come to agreement that they don't remember the >> deprecated node, the cosmetic issue will no longer be a concern in any >> system.logs, nodetool describecluster commands, nor nodetool gossipinfo >> output. >> >> -Original Message- >> From: Kenneth Brotman [mailto:kenbrot...@yahoo.com.INVALID] >> Sent: Thursday, April 04, 2019 10:40 AM >> To: user@cassandra.apache.org >> Subject: RE: Assassinate fails >> >> Alex, >> >> Did you remove the option JVM_OPTS="$JVM_OPTS >> -Dcassandra.replace_address=address_of_dead_node after the node started and >> then restart the node again? >> >> Are you sure there isn't a typo in the file? >> >> Ken >> >> -Original Message- >> From: Kenneth Brotman [mailto:kenbrot...@yahoo.com.INVALID] >> Sent: Thursday, April 04, 2019 10:31 AM >> To: user@cassandra.apache.org >> Subject: RE: Assassinate fails >> >> I see; system_auth is a separate keyspace. >> >> -Original Message- >> From: Jon Haddad [mailto:j...@jonhaddad.com] >> Sent: Thursday, April 04, 2019 10:17 AM >> To: user@cassandra.apache.org >> Subject: Re: Assassinate fails >> >> No, it can't. As Alain (and I) have said, since the system keyspace >> is local strategy, it's not replicated, and thus can't be repaired. >> >> On Thu, Apr 4, 2019 at 9:54 AM Kenneth Brotman >> wrote: >>> >>> Right, could be similar issue, same type of fix though. >>> >>> -Original Message- >>> From: Jon Haddad [mailto:j...@jonhaddad.com] >>> Sent: Thursday, April 04, 2019 9:52 AM >>> To: user@cassandra.apache.org >>> Subject: Re: Assassinate fails >>> >>> System != system_auth. >>> >>> On Thu, Apr 4, 2019 at 9:43 AM Kenneth Brotman >>> wrote: >>>> >>>> From Mastering Cassandra: >>>> >>>> >>>> Forcing read repairs at consistency - ALL >>>> >>>> The type of repair isn't really part of the Apache Cassandra repair >>>> paradigm at all. When it was discovered that a read repair will trigger >>>> 100% of the time when a query is run at ALL consistency, this method of >>>> repair started to gain popularity in the community. In some cases, this >>>> method of forcing data consistency provided better results than normal, >>>> scheduled repairs. >>>&
Re: Assassinate fails
sn’t enough. In such cases, shoot for > the moon and try 20 assassination attempts in quick succession. > > What we are trying to do is to create a flood of messages requesting all > nodes completely forget there used to be an entry within the gossip state > for the given IP address. If each node can prune its own gossip state and > broadcast that to the rest of the nodes, we should eliminate any race > conditions that may exist where at least one node still remembers the given > IP address. > > As soon as all nodes come to agreement that they don’t remember the > deprecated node, the cosmetic issue will no longer be a concern in any > system.logs, nodetool describecluster commands, nor nodetool gossipinfo > output. > > > > > > -Original Message----- > From: Kenneth Brotman [mailto:kenbrot...@yahoo.com.INVALID] > Sent: Thursday, April 04, 2019 10:40 AM > To: user@cassandra.apache.org > Subject: RE: Assassinate fails > > Alex, > > Did you remove the option JVM_OPTS="$JVM_OPTS > -Dcassandra.replace_address=address_of_dead_node after the node started and > then restart the node again? > > Are you sure there isn't a typo in the file? > > Ken > > > -Original Message- > From: Kenneth Brotman [mailto:kenbrot...@yahoo.com.INVALID] > Sent: Thursday, April 04, 2019 10:31 AM > To: user@cassandra.apache.org > Subject: RE: Assassinate fails > > I see; system_auth is a separate keyspace. > > -Original Message- > From: Jon Haddad [mailto:j...@jonhaddad.com] > Sent: Thursday, April 04, 2019 10:17 AM > To: user@cassandra.apache.org > Subject: Re: Assassinate fails > > No, it can't. As Alain (and I) have said, since the system keyspace > is local strategy, it's not replicated, and thus can't be repaired. > > On Thu, Apr 4, 2019 at 9:54 AM Kenneth Brotman > wrote: > > > > Right, could be similar issue, same type of fix though. > > > > -Original Message- > > From: Jon Haddad [mailto:j...@jonhaddad.com] > > Sent: Thursday, April 04, 2019 9:52 AM > > To: user@cassandra.apache.org > > Subject: Re: Assassinate fails > > > > System != system_auth. > > > > On Thu, Apr 4, 2019 at 9:43 AM Kenneth Brotman > > wrote: > > > > > > From Mastering Cassandra: > > > > > > > > > Forcing read repairs at consistency – ALL > > > > > > The type of repair isn't really part of the Apache Cassandra repair > paradigm at all. When it was discovered that a read repair will trigger > 100% of the time when a query is run at ALL consistency, this method of > repair started to gain popularity in the community. In some cases, this > method of forcing data consistency provided better results than normal, > scheduled repairs. > > > > > > Let's assume, for a second, that an application team is having a hard > time logging into a node in a new data center. You try to cqlsh out to > these nodes, and notice that you are also experiencing intermittent > failures, leading you to suspect that the system_auth tables might be > missing a replica or two. On one node you do manage to connect successfully > using cqlsh. One quick way to fix consistency on the system_auth tables is > to set consistency to ALL, and run an unbound SELECT on every table, > tickling each record: > > > > > > use system_auth ; > > > consistency ALL; > > > consistency level set to ALL. > > > > > > SELECT COUNT(*) FROM resource_role_permissons_index ; > > > SELECT COUNT(*) FROM role_permissions ; > > > SELECT COUNT(*) FROM role_members ; > > > SELECT COUNT(*) FROM roles; > > > > > > This problem is often seen when logging in with the default cassandra > user. Within cqlsh, there is code that forces the default cassandra user to > connect by querying system_auth at QUORUM consistency. This can be > problematic in larger clusters, and is another reason why you should never > use the default cassandra user. > > > > > > > > > > > > -Original Message- > > > From: Jon Haddad [mailto:j...@jonhaddad.com] > > > Sent: Thursday, April 04, 2019 9:21 AM > > > To: user@cassandra.apache.org > > > Subject: Re: Assassinate fails > > > > > > Ken, > > > > > > Alain is right about the system tables. What you're describing only > > > works on non-local tables. Changing the CL doesn't help with > > > keyspaces that use LocalStrategy. Here's the definition of the system > > > keyspace: > > > > > > CREATE KEYSPACE system WITH replication = {'class': 'LocalStrategy'} &g
RE: Assassinate fails
Alex, According to this TLP article http://thelastpickle.com/blog/2018/09/18/assassinate.html : Note that the LEFT status should stick around for 72 hours to ensure all nodes come to the consensus that the node has been removed. So please don’t rush things if that’s the case. Again, it’s only cosmetic. If a gossip state will not forget a node that was removed from the cluster more than a week ago: Login to each node within the Cassandra cluster. Download jmxterm on each node, if nodetool assassinate is not an option. Run nodetool assassinate, or the unsafeAssassinateEndpoint command, multiple times in quick succession. I typically recommend running the command 3-5 times within 2 seconds. I understand that sometimes the command takes time to return, so the “2 seconds” suggestion is less of a requirement than it is a mindset. Also, sometimes 3-5 times isn’t enough. In such cases, shoot for the moon and try 20 assassination attempts in quick succession. What we are trying to do is to create a flood of messages requesting all nodes completely forget there used to be an entry within the gossip state for the given IP address. If each node can prune its own gossip state and broadcast that to the rest of the nodes, we should eliminate any race conditions that may exist where at least one node still remembers the given IP address. As soon as all nodes come to agreement that they don’t remember the deprecated node, the cosmetic issue will no longer be a concern in any system.logs, nodetool describecluster commands, nor nodetool gossipinfo output. -Original Message- From: Kenneth Brotman [mailto:kenbrot...@yahoo.com.INVALID] Sent: Thursday, April 04, 2019 10:40 AM To: user@cassandra.apache.org Subject: RE: Assassinate fails Alex, Did you remove the option JVM_OPTS="$JVM_OPTS -Dcassandra.replace_address=address_of_dead_node after the node started and then restart the node again? Are you sure there isn't a typo in the file? Ken -Original Message- From: Kenneth Brotman [mailto:kenbrot...@yahoo.com.INVALID] Sent: Thursday, April 04, 2019 10:31 AM To: user@cassandra.apache.org Subject: RE: Assassinate fails I see; system_auth is a separate keyspace. -Original Message- From: Jon Haddad [mailto:j...@jonhaddad.com] Sent: Thursday, April 04, 2019 10:17 AM To: user@cassandra.apache.org Subject: Re: Assassinate fails No, it can't. As Alain (and I) have said, since the system keyspace is local strategy, it's not replicated, and thus can't be repaired. On Thu, Apr 4, 2019 at 9:54 AM Kenneth Brotman wrote: > > Right, could be similar issue, same type of fix though. > > -Original Message- > From: Jon Haddad [mailto:j...@jonhaddad.com] > Sent: Thursday, April 04, 2019 9:52 AM > To: user@cassandra.apache.org > Subject: Re: Assassinate fails > > System != system_auth. > > On Thu, Apr 4, 2019 at 9:43 AM Kenneth Brotman > wrote: > > > > From Mastering Cassandra: > > > > > > Forcing read repairs at consistency – ALL > > > > The type of repair isn't really part of the Apache Cassandra repair > > paradigm at all. When it was discovered that a read repair will trigger > > 100% of the time when a query is run at ALL consistency, this method of > > repair started to gain popularity in the community. In some cases, this > > method of forcing data consistency provided better results than normal, > > scheduled repairs. > > > > Let's assume, for a second, that an application team is having a hard time > > logging into a node in a new data center. You try to cqlsh out to these > > nodes, and notice that you are also experiencing intermittent failures, > > leading you to suspect that the system_auth tables might be missing a > > replica or two. On one node you do manage to connect successfully using > > cqlsh. One quick way to fix consistency on the system_auth tables is to set > > consistency to ALL, and run an unbound SELECT on every table, tickling each > > record: > > > > use system_auth ; > > consistency ALL; > > consistency level set to ALL. > > > > SELECT COUNT(*) FROM resource_role_permissons_index ; > > SELECT COUNT(*) FROM role_permissions ; > > SELECT COUNT(*) FROM role_members ; > > SELECT COUNT(*) FROM roles; > > > > This problem is often seen when logging in with the default cassandra user. > > Within cqlsh, there is code that forces the default cassandra user to > > connect by querying system_auth at QUORUM consistency. This can be > > problematic in larger clusters, and is another reason why you should never > > use the default cassandra user. > > > > > > > > -Original Message- > > From: Jon Haddad
RE: Assassinate fails
Alex, Did you remove the option JVM_OPTS="$JVM_OPTS -Dcassandra.replace_address=address_of_dead_node after the node started and then restart the node again? Are you sure there isn't a typo in the file? Ken -Original Message- From: Kenneth Brotman [mailto:kenbrot...@yahoo.com.INVALID] Sent: Thursday, April 04, 2019 10:31 AM To: user@cassandra.apache.org Subject: RE: Assassinate fails I see; system_auth is a separate keyspace. -Original Message- From: Jon Haddad [mailto:j...@jonhaddad.com] Sent: Thursday, April 04, 2019 10:17 AM To: user@cassandra.apache.org Subject: Re: Assassinate fails No, it can't. As Alain (and I) have said, since the system keyspace is local strategy, it's not replicated, and thus can't be repaired. On Thu, Apr 4, 2019 at 9:54 AM Kenneth Brotman wrote: > > Right, could be similar issue, same type of fix though. > > -Original Message- > From: Jon Haddad [mailto:j...@jonhaddad.com] > Sent: Thursday, April 04, 2019 9:52 AM > To: user@cassandra.apache.org > Subject: Re: Assassinate fails > > System != system_auth. > > On Thu, Apr 4, 2019 at 9:43 AM Kenneth Brotman > wrote: > > > > From Mastering Cassandra: > > > > > > Forcing read repairs at consistency – ALL > > > > The type of repair isn't really part of the Apache Cassandra repair > > paradigm at all. When it was discovered that a read repair will trigger > > 100% of the time when a query is run at ALL consistency, this method of > > repair started to gain popularity in the community. In some cases, this > > method of forcing data consistency provided better results than normal, > > scheduled repairs. > > > > Let's assume, for a second, that an application team is having a hard time > > logging into a node in a new data center. You try to cqlsh out to these > > nodes, and notice that you are also experiencing intermittent failures, > > leading you to suspect that the system_auth tables might be missing a > > replica or two. On one node you do manage to connect successfully using > > cqlsh. One quick way to fix consistency on the system_auth tables is to set > > consistency to ALL, and run an unbound SELECT on every table, tickling each > > record: > > > > use system_auth ; > > consistency ALL; > > consistency level set to ALL. > > > > SELECT COUNT(*) FROM resource_role_permissons_index ; > > SELECT COUNT(*) FROM role_permissions ; > > SELECT COUNT(*) FROM role_members ; > > SELECT COUNT(*) FROM roles; > > > > This problem is often seen when logging in with the default cassandra user. > > Within cqlsh, there is code that forces the default cassandra user to > > connect by querying system_auth at QUORUM consistency. This can be > > problematic in larger clusters, and is another reason why you should never > > use the default cassandra user. > > > > > > > > -Original Message- > > From: Jon Haddad [mailto:j...@jonhaddad.com] > > Sent: Thursday, April 04, 2019 9:21 AM > > To: user@cassandra.apache.org > > Subject: Re: Assassinate fails > > > > Ken, > > > > Alain is right about the system tables. What you're describing only > > works on non-local tables. Changing the CL doesn't help with > > keyspaces that use LocalStrategy. Here's the definition of the system > > keyspace: > > > > CREATE KEYSPACE system WITH replication = {'class': 'LocalStrategy'} > > AND durable_writes = true; > > > > Jon > > > > On Thu, Apr 4, 2019 at 9:03 AM Kenneth Brotman > > wrote: > > > > > > The trick below I got from the book Mastering Cassandra. You have to set > > > the consistency to ALL for it to work. I thought you guys knew that one. > > > > > > > > > > > > From: Alain RODRIGUEZ [mailto:arodr...@gmail.com] > > > Sent: Thursday, April 04, 2019 8:46 AM > > > To: user cassandra.apache.org > > > Subject: Re: Assassinate fails > > > > > > > > > > > > Hi Alex, > > > > > > > > > > > > About previous advices: > > > > > > > > > > > > You might have inconsistent data in your system tables. Try setting the > > > consistency level to ALL, then do read query of system tables to force > > > repair. > > > > > > > > > > > > System tables use the 'LocalStrategy', thus I don't think any repair > > > would happen for the system.* tables. Regardless the consistency you use. > > > It should not harm, but I really
RE: Assassinate fails
I see; system_auth is a separate keyspace. -Original Message- From: Jon Haddad [mailto:j...@jonhaddad.com] Sent: Thursday, April 04, 2019 10:17 AM To: user@cassandra.apache.org Subject: Re: Assassinate fails No, it can't. As Alain (and I) have said, since the system keyspace is local strategy, it's not replicated, and thus can't be repaired. On Thu, Apr 4, 2019 at 9:54 AM Kenneth Brotman wrote: > > Right, could be similar issue, same type of fix though. > > -Original Message- > From: Jon Haddad [mailto:j...@jonhaddad.com] > Sent: Thursday, April 04, 2019 9:52 AM > To: user@cassandra.apache.org > Subject: Re: Assassinate fails > > System != system_auth. > > On Thu, Apr 4, 2019 at 9:43 AM Kenneth Brotman > wrote: > > > > From Mastering Cassandra: > > > > > > Forcing read repairs at consistency – ALL > > > > The type of repair isn't really part of the Apache Cassandra repair > > paradigm at all. When it was discovered that a read repair will trigger > > 100% of the time when a query is run at ALL consistency, this method of > > repair started to gain popularity in the community. In some cases, this > > method of forcing data consistency provided better results than normal, > > scheduled repairs. > > > > Let's assume, for a second, that an application team is having a hard time > > logging into a node in a new data center. You try to cqlsh out to these > > nodes, and notice that you are also experiencing intermittent failures, > > leading you to suspect that the system_auth tables might be missing a > > replica or two. On one node you do manage to connect successfully using > > cqlsh. One quick way to fix consistency on the system_auth tables is to set > > consistency to ALL, and run an unbound SELECT on every table, tickling each > > record: > > > > use system_auth ; > > consistency ALL; > > consistency level set to ALL. > > > > SELECT COUNT(*) FROM resource_role_permissons_index ; > > SELECT COUNT(*) FROM role_permissions ; > > SELECT COUNT(*) FROM role_members ; > > SELECT COUNT(*) FROM roles; > > > > This problem is often seen when logging in with the default cassandra user. > > Within cqlsh, there is code that forces the default cassandra user to > > connect by querying system_auth at QUORUM consistency. This can be > > problematic in larger clusters, and is another reason why you should never > > use the default cassandra user. > > > > > > > > -Original Message- > > From: Jon Haddad [mailto:j...@jonhaddad.com] > > Sent: Thursday, April 04, 2019 9:21 AM > > To: user@cassandra.apache.org > > Subject: Re: Assassinate fails > > > > Ken, > > > > Alain is right about the system tables. What you're describing only > > works on non-local tables. Changing the CL doesn't help with > > keyspaces that use LocalStrategy. Here's the definition of the system > > keyspace: > > > > CREATE KEYSPACE system WITH replication = {'class': 'LocalStrategy'} > > AND durable_writes = true; > > > > Jon > > > > On Thu, Apr 4, 2019 at 9:03 AM Kenneth Brotman > > wrote: > > > > > > The trick below I got from the book Mastering Cassandra. You have to set > > > the consistency to ALL for it to work. I thought you guys knew that one. > > > > > > > > > > > > From: Alain RODRIGUEZ [mailto:arodr...@gmail.com] > > > Sent: Thursday, April 04, 2019 8:46 AM > > > To: user cassandra.apache.org > > > Subject: Re: Assassinate fails > > > > > > > > > > > > Hi Alex, > > > > > > > > > > > > About previous advices: > > > > > > > > > > > > You might have inconsistent data in your system tables. Try setting the > > > consistency level to ALL, then do read query of system tables to force > > > repair. > > > > > > > > > > > > System tables use the 'LocalStrategy', thus I don't think any repair > > > would happen for the system.* tables. Regardless the consistency you use. > > > It should not harm, but I really think it won't help. > > > > > > > > > > - > > To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org > > For additional commands, e-mail: user-h...@cassandra.apache.org > > > > > > - > > To unsubscrib
Re: Assassinate fails
No, it can't. As Alain (and I) have said, since the system keyspace is local strategy, it's not replicated, and thus can't be repaired. On Thu, Apr 4, 2019 at 9:54 AM Kenneth Brotman wrote: > > Right, could be similar issue, same type of fix though. > > -Original Message- > From: Jon Haddad [mailto:j...@jonhaddad.com] > Sent: Thursday, April 04, 2019 9:52 AM > To: user@cassandra.apache.org > Subject: Re: Assassinate fails > > System != system_auth. > > On Thu, Apr 4, 2019 at 9:43 AM Kenneth Brotman > wrote: > > > > From Mastering Cassandra: > > > > > > Forcing read repairs at consistency – ALL > > > > The type of repair isn't really part of the Apache Cassandra repair > > paradigm at all. When it was discovered that a read repair will trigger > > 100% of the time when a query is run at ALL consistency, this method of > > repair started to gain popularity in the community. In some cases, this > > method of forcing data consistency provided better results than normal, > > scheduled repairs. > > > > Let's assume, for a second, that an application team is having a hard time > > logging into a node in a new data center. You try to cqlsh out to these > > nodes, and notice that you are also experiencing intermittent failures, > > leading you to suspect that the system_auth tables might be missing a > > replica or two. On one node you do manage to connect successfully using > > cqlsh. One quick way to fix consistency on the system_auth tables is to set > > consistency to ALL, and run an unbound SELECT on every table, tickling each > > record: > > > > use system_auth ; > > consistency ALL; > > consistency level set to ALL. > > > > SELECT COUNT(*) FROM resource_role_permissons_index ; > > SELECT COUNT(*) FROM role_permissions ; > > SELECT COUNT(*) FROM role_members ; > > SELECT COUNT(*) FROM roles; > > > > This problem is often seen when logging in with the default cassandra user. > > Within cqlsh, there is code that forces the default cassandra user to > > connect by querying system_auth at QUORUM consistency. This can be > > problematic in larger clusters, and is another reason why you should never > > use the default cassandra user. > > > > > > > > -Original Message- > > From: Jon Haddad [mailto:j...@jonhaddad.com] > > Sent: Thursday, April 04, 2019 9:21 AM > > To: user@cassandra.apache.org > > Subject: Re: Assassinate fails > > > > Ken, > > > > Alain is right about the system tables. What you're describing only > > works on non-local tables. Changing the CL doesn't help with > > keyspaces that use LocalStrategy. Here's the definition of the system > > keyspace: > > > > CREATE KEYSPACE system WITH replication = {'class': 'LocalStrategy'} > > AND durable_writes = true; > > > > Jon > > > > On Thu, Apr 4, 2019 at 9:03 AM Kenneth Brotman > > wrote: > > > > > > The trick below I got from the book Mastering Cassandra. You have to set > > > the consistency to ALL for it to work. I thought you guys knew that one. > > > > > > > > > > > > From: Alain RODRIGUEZ [mailto:arodr...@gmail.com] > > > Sent: Thursday, April 04, 2019 8:46 AM > > > To: user cassandra.apache.org > > > Subject: Re: Assassinate fails > > > > > > > > > > > > Hi Alex, > > > > > > > > > > > > About previous advices: > > > > > > > > > > > > You might have inconsistent data in your system tables. Try setting the > > > consistency level to ALL, then do read query of system tables to force > > > repair. > > > > > > > > > > > > System tables use the 'LocalStrategy', thus I don't think any repair > > > would happen for the system.* tables. Regardless the consistency you use. > > > It should not harm, but I really think it won't help. > > > > > > > > > > - > > To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org > > For additional commands, e-mail: user-h...@cassandra.apache.org > > > > > > - > > To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org > > For additional commands, e-mail: user-h...@cassandra.apache.org > > > > - > To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org > For additional commands, e-mail: user-h...@cassandra.apache.org > > > - > To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org > For additional commands, e-mail: user-h...@cassandra.apache.org > - To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org For additional commands, e-mail: user-h...@cassandra.apache.org
RE: Assassinate fails
Right, could be similar issue, same type of fix though. -Original Message- From: Jon Haddad [mailto:j...@jonhaddad.com] Sent: Thursday, April 04, 2019 9:52 AM To: user@cassandra.apache.org Subject: Re: Assassinate fails System != system_auth. On Thu, Apr 4, 2019 at 9:43 AM Kenneth Brotman wrote: > > From Mastering Cassandra: > > > Forcing read repairs at consistency – ALL > > The type of repair isn't really part of the Apache Cassandra repair paradigm > at all. When it was discovered that a read repair will trigger 100% of the > time when a query is run at ALL consistency, this method of repair started to > gain popularity in the community. In some cases, this method of forcing data > consistency provided better results than normal, scheduled repairs. > > Let's assume, for a second, that an application team is having a hard time > logging into a node in a new data center. You try to cqlsh out to these > nodes, and notice that you are also experiencing intermittent failures, > leading you to suspect that the system_auth tables might be missing a replica > or two. On one node you do manage to connect successfully using cqlsh. One > quick way to fix consistency on the system_auth tables is to set consistency > to ALL, and run an unbound SELECT on every table, tickling each record: > > use system_auth ; > consistency ALL; > consistency level set to ALL. > > SELECT COUNT(*) FROM resource_role_permissons_index ; > SELECT COUNT(*) FROM role_permissions ; > SELECT COUNT(*) FROM role_members ; > SELECT COUNT(*) FROM roles; > > This problem is often seen when logging in with the default cassandra user. > Within cqlsh, there is code that forces the default cassandra user to connect > by querying system_auth at QUORUM consistency. This can be problematic in > larger clusters, and is another reason why you should never use the default > cassandra user. > > > > -Original Message- > From: Jon Haddad [mailto:j...@jonhaddad.com] > Sent: Thursday, April 04, 2019 9:21 AM > To: user@cassandra.apache.org > Subject: Re: Assassinate fails > > Ken, > > Alain is right about the system tables. What you're describing only > works on non-local tables. Changing the CL doesn't help with > keyspaces that use LocalStrategy. Here's the definition of the system > keyspace: > > CREATE KEYSPACE system WITH replication = {'class': 'LocalStrategy'} > AND durable_writes = true; > > Jon > > On Thu, Apr 4, 2019 at 9:03 AM Kenneth Brotman > wrote: > > > > The trick below I got from the book Mastering Cassandra. You have to set > > the consistency to ALL for it to work. I thought you guys knew that one. > > > > > > > > From: Alain RODRIGUEZ [mailto:arodr...@gmail.com] > > Sent: Thursday, April 04, 2019 8:46 AM > > To: user cassandra.apache.org > > Subject: Re: Assassinate fails > > > > > > > > Hi Alex, > > > > > > > > About previous advices: > > > > > > > > You might have inconsistent data in your system tables. Try setting the > > consistency level to ALL, then do read query of system tables to force > > repair. > > > > > > > > System tables use the 'LocalStrategy', thus I don't think any repair would > > happen for the system.* tables. Regardless the consistency you use. It > > should not harm, but I really think it won't help. > > > > > > - > To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org > For additional commands, e-mail: user-h...@cassandra.apache.org > > > - > To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org > For additional commands, e-mail: user-h...@cassandra.apache.org > - To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org For additional commands, e-mail: user-h...@cassandra.apache.org - To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org For additional commands, e-mail: user-h...@cassandra.apache.org
Re: Assassinate fails
System != system_auth. On Thu, Apr 4, 2019 at 9:43 AM Kenneth Brotman wrote: > > From Mastering Cassandra: > > > Forcing read repairs at consistency – ALL > > The type of repair isn't really part of the Apache Cassandra repair paradigm > at all. When it was discovered that a read repair will trigger 100% of the > time when a query is run at ALL consistency, this method of repair started to > gain popularity in the community. In some cases, this method of forcing data > consistency provided better results than normal, scheduled repairs. > > Let's assume, for a second, that an application team is having a hard time > logging into a node in a new data center. You try to cqlsh out to these > nodes, and notice that you are also experiencing intermittent failures, > leading you to suspect that the system_auth tables might be missing a replica > or two. On one node you do manage to connect successfully using cqlsh. One > quick way to fix consistency on the system_auth tables is to set consistency > to ALL, and run an unbound SELECT on every table, tickling each record: > > use system_auth ; > consistency ALL; > consistency level set to ALL. > > SELECT COUNT(*) FROM resource_role_permissons_index ; > SELECT COUNT(*) FROM role_permissions ; > SELECT COUNT(*) FROM role_members ; > SELECT COUNT(*) FROM roles; > > This problem is often seen when logging in with the default cassandra user. > Within cqlsh, there is code that forces the default cassandra user to connect > by querying system_auth at QUORUM consistency. This can be problematic in > larger clusters, and is another reason why you should never use the default > cassandra user. > > > > -Original Message- > From: Jon Haddad [mailto:j...@jonhaddad.com] > Sent: Thursday, April 04, 2019 9:21 AM > To: user@cassandra.apache.org > Subject: Re: Assassinate fails > > Ken, > > Alain is right about the system tables. What you're describing only > works on non-local tables. Changing the CL doesn't help with > keyspaces that use LocalStrategy. Here's the definition of the system > keyspace: > > CREATE KEYSPACE system WITH replication = {'class': 'LocalStrategy'} > AND durable_writes = true; > > Jon > > On Thu, Apr 4, 2019 at 9:03 AM Kenneth Brotman > wrote: > > > > The trick below I got from the book Mastering Cassandra. You have to set > > the consistency to ALL for it to work. I thought you guys knew that one. > > > > > > > > From: Alain RODRIGUEZ [mailto:arodr...@gmail.com] > > Sent: Thursday, April 04, 2019 8:46 AM > > To: user cassandra.apache.org > > Subject: Re: Assassinate fails > > > > > > > > Hi Alex, > > > > > > > > About previous advices: > > > > > > > > You might have inconsistent data in your system tables. Try setting the > > consistency level to ALL, then do read query of system tables to force > > repair. > > > > > > > > System tables use the 'LocalStrategy', thus I don't think any repair would > > happen for the system.* tables. Regardless the consistency you use. It > > should not harm, but I really think it won't help. > > > > > > - > To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org > For additional commands, e-mail: user-h...@cassandra.apache.org > > > - > To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org > For additional commands, e-mail: user-h...@cassandra.apache.org > - To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org For additional commands, e-mail: user-h...@cassandra.apache.org
Re: Assassinate fails
Well, I tried : rolling restart did not work its magic. |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) Host ID Rack UN 192.168.1.9 26.32 GiB 256 42.8% 76223d4c-9d9f-417f-be27-cebb791cddcc rack1 UN 192.168.1.12 31.35 GiB 256 38.9% 719601e2-54a6-440e-a379-c9cf2dc20564 rack1 UN 192.168.1.17 25.2 GiB 256 41.4% fa238b21-1db1-47dc-bfb7-beedc6c9967a rack1 DN 192.168.1.18 ? 256 39.8% null rack1 UN 192.168.1.22 27.7 GiB 256 37.2% 09d24557-4e98-44c3-8c9d-53c4c31066e1 rack1 Alex Le 04.04.2019 18:26, Alex a écrit : > Hi, > > @ Alain and Kenneth : > > I use C* for a time series database (KairosDB) ; replication and consistency > are set by KairosDB and I would rather not mingle with it. > > @ Nick and Alain : > > I have tried to stop / start every node but not with this process. I will > try. > > @ Jeff : I removed (replaced) this node 13 days ago. > > @ Alain : In system.peers I see both the dead node and its replacement with > the same ID : > > peer | host_id > --+-- > 192.168.1.18 | 09d24557-4e98-44c3-8c9d-53c4c31066e1 > 192.168.1.22 | 09d24557-4e98-44c3-8c9d-53c4c31066e1 > > Is it expected ? > > If I cannot fix this, I think I will add new nodes and remove, one by one, > the nodes that show the dead node in nodetool status. > > Thanks all for your help. > > Alex > > Le 04.04.2019 17:45, Alain RODRIGUEZ a écrit : > > Hi Alex, > > About previous advices: > > You might have inconsistent data in your system tables. Try setting the > consistency level to ALL, then do read query of system tables to force > repair. > > System tables use the 'LocalStrategy', thus I don't think any repair would > happen for the system.* tables. Regardless the consistency you use. It should > not harm, but I really think it won't help. > This will sound a little silly but, have you tried rolling the cluster? > > The other way around, the rolling restart does not sound that silly to me. I > would try it before touching any other 'deeper' systems. It has indeed > sometimes proven to do some magic for me as well. It's hard to guess on this > kind of ghost node issues without being working on the machine (and sometimes > even when accessing the machine I had some trouble =)). Also a rolling > restart is an operation that should be easy to perform and with low risk (if > everything is well configured). > > Other idea to explore: > > You can actually select the 'system.peers' table to see if all (other) nodes > are referenced for each node. There should not be any dead nodes in there. By > the way you will see that different nodes have slightly different data in > system.peers, and are not in sync, thus no way to 'repair' that really. > 'Select' is safe. If you delete non-existing 'peers' if any, If the node is > dead anyway, this shouldn't hurt, but make sure you are doing the right thing > you can easily break your cluster from there. I did not see an issue (a bug) > of those for a while though. Normally you should not have to go that deep > touching system tables. > > Then also nodes removed should be immediately removed from peers but persist > for some time (7 days maybe?) in the gossip information (normally as 'LEFT'). > This should not create the issue in 'nodetool describecluster' though. > > C*heers, > > --- > Alain Rodriguez - al...@thelastpickle.com > France / Spain > > The Last Pickle - Apache Cassandra Consulting > http://www.thelastpickle.com > > Le jeu. 4 avr. 2019 à 16:09, Nick Hatfield a > écrit : > > This will sound a little silly but, have you tried rolling the cluster? > > $> nodetool flush; nodetool drain; service cassandra stop > $> ps aux | grep 'cassandra' > > # make sure the process actually dies. If not you may need to kill -9 . > Check first to see if nodetool can connect first, nodetool gossipinfo. If the > connection is live and listening on the port, then just try re-running > service cassandra stop again. Kill -9 as a last resort > > $> service cassandra start > $> nodetool netstats | grep 'NORMAL' # wait for this to return before moving > on to the next node. > > Restart them all using this method, then run nodetool status again and see if > it is listed. > > Once other thing, I recall you said something about having to terminate a > node and then replace it. Make sure that whiche
RE: Assassinate fails
>From Mastering Cassandra: Forcing read repairs at consistency – ALL The type of repair isn't really part of the Apache Cassandra repair paradigm at all. When it was discovered that a read repair will trigger 100% of the time when a query is run at ALL consistency, this method of repair started to gain popularity in the community. In some cases, this method of forcing data consistency provided better results than normal, scheduled repairs. Let's assume, for a second, that an application team is having a hard time logging into a node in a new data center. You try to cqlsh out to these nodes, and notice that you are also experiencing intermittent failures, leading you to suspect that the system_auth tables might be missing a replica or two. On one node you do manage to connect successfully using cqlsh. One quick way to fix consistency on the system_auth tables is to set consistency to ALL, and run an unbound SELECT on every table, tickling each record: use system_auth ; consistency ALL; consistency level set to ALL. SELECT COUNT(*) FROM resource_role_permissons_index ; SELECT COUNT(*) FROM role_permissions ; SELECT COUNT(*) FROM role_members ; SELECT COUNT(*) FROM roles; This problem is often seen when logging in with the default cassandra user. Within cqlsh, there is code that forces the default cassandra user to connect by querying system_auth at QUORUM consistency. This can be problematic in larger clusters, and is another reason why you should never use the default cassandra user. -Original Message- From: Jon Haddad [mailto:j...@jonhaddad.com] Sent: Thursday, April 04, 2019 9:21 AM To: user@cassandra.apache.org Subject: Re: Assassinate fails Ken, Alain is right about the system tables. What you're describing only works on non-local tables. Changing the CL doesn't help with keyspaces that use LocalStrategy. Here's the definition of the system keyspace: CREATE KEYSPACE system WITH replication = {'class': 'LocalStrategy'} AND durable_writes = true; Jon On Thu, Apr 4, 2019 at 9:03 AM Kenneth Brotman wrote: > > The trick below I got from the book Mastering Cassandra. You have to set the > consistency to ALL for it to work. I thought you guys knew that one. > > > > From: Alain RODRIGUEZ [mailto:arodr...@gmail.com] > Sent: Thursday, April 04, 2019 8:46 AM > To: user cassandra.apache.org > Subject: Re: Assassinate fails > > > > Hi Alex, > > > > About previous advices: > > > > You might have inconsistent data in your system tables. Try setting the > consistency level to ALL, then do read query of system tables to force repair. > > > > System tables use the 'LocalStrategy', thus I don't think any repair would > happen for the system.* tables. Regardless the consistency you use. It should > not harm, but I really think it won't help. > > - To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org For additional commands, e-mail: user-h...@cassandra.apache.org - To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org For additional commands, e-mail: user-h...@cassandra.apache.org
Re: Assassinate fails
Hi, @ Alain and Kenneth : I use C* for a time series database (KairosDB) ; replication and consistency are set by KairosDB and I would rather not mingle with it. @ Nick and Alain : I have tried to stop / start every node but not with this process. I will try. @ Jeff : I removed (replaced) this node 13 days ago. @ Alain : In system.peers I see both the dead node and its replacement with the same ID : peer | host_id --+-- 192.168.1.18 | 09d24557-4e98-44c3-8c9d-53c4c31066e1 192.168.1.22 | 09d24557-4e98-44c3-8c9d-53c4c31066e1 Is it expected ? If I cannot fix this, I think I will add new nodes and remove, one by one, the nodes that show the dead node in nodetool status. Thanks all for your help. Alex Le 04.04.2019 17:45, Alain RODRIGUEZ a écrit : > Hi Alex, > > About previous advices: > >> You might have inconsistent data in your system tables. Try setting the >> consistency level to ALL, then do read query of system tables to force >> repair. > > System tables use the 'LocalStrategy', thus I don't think any repair would > happen for the system.* tables. Regardless the consistency you use. It should > not harm, but I really think it won't help. > >> This will sound a little silly but, have you tried rolling the cluster? > > The other way around, the rolling restart does not sound that silly to me. I > would try it before touching any other 'deeper' systems. It has indeed > sometimes proven to do some magic for me as well. It's hard to guess on this > kind of ghost node issues without being working on the machine (and sometimes > even when accessing the machine I had some trouble =)). Also a rolling > restart is an operation that should be easy to perform and with low risk (if > everything is well configured). > > Other idea to explore: > > You can actually select the 'system.peers' table to see if all (other) nodes > are referenced for each node. There should not be any dead nodes in there. By > the way you will see that different nodes have slightly different data in > system.peers, and are not in sync, thus no way to 'repair' that really. > 'Select' is safe. If you delete non-existing 'peers' if any, If the node is > dead anyway, this shouldn't hurt, but make sure you are doing the right thing > you can easily break your cluster from there. I did not see an issue (a bug) > of those for a while though. Normally you should not have to go that deep > touching system tables. > > Then also nodes removed should be immediately removed from peers but persist > for some time (7 days maybe?) in the gossip information (normally as 'LEFT'). > This should not create the issue in 'nodetool describecluster' though. > > C*heers, > > --- > Alain Rodriguez - al...@thelastpickle.com > France / Spain > > The Last Pickle - Apache Cassandra Consulting > http://www.thelastpickle.com > > Le jeu. 4 avr. 2019 à 16:09, Nick Hatfield a > écrit : > > This will sound a little silly but, have you tried rolling the cluster? > > $> nodetool flush; nodetool drain; service cassandra stop > $> ps aux | grep 'cassandra' > > # make sure the process actually dies. If not you may need to kill -9 . > Check first to see if nodetool can connect first, nodetool gossipinfo. If the > connection is live and listening on the port, then just try re-running > service cassandra stop again. Kill -9 as a last resort > > $> service cassandra start > $> nodetool netstats | grep 'NORMAL' # wait for this to return before moving > on to the next node. > > Restart them all using this method, then run nodetool status again and see if > it is listed. > > Once other thing, I recall you said something about having to terminate a > node and then replace it. Make sure that whichever node you did the -Dreplace > flag on, does not still have it set when you start cassandra on it again! > > FROM: Alex [mailto:m...@aca-o.com] > SENT: Thursday, April 04, 2019 4:58 AM > TO: user@cassandra.apache.org > SUBJECT: Re: Assassinate fails > > Hi Anthony, > > Thanks for your help. > > I tried to run multiple times in quick succession but it fails with : > > -- StackTrace -- > java.lang.RuntimeException: Endpoint still alive: /192.168.1.18 [1] > generation changed while trying to assassinate it > at org.apache.cassandra.gms.Gossiper.assassinateEndpoint(Gossiper.java:592) > > I can see that the generation number for this node increases by 1 every time > I call nodetool assassinate ; and the command itself waits for 30 seconds > before assassinating node. When ran multiple times in quick succession, the &
Re: Assassinate fails
Ken, Alain is right about the system tables. What you're describing only works on non-local tables. Changing the CL doesn't help with keyspaces that use LocalStrategy. Here's the definition of the system keyspace: CREATE KEYSPACE system WITH replication = {'class': 'LocalStrategy'} AND durable_writes = true; Jon On Thu, Apr 4, 2019 at 9:03 AM Kenneth Brotman wrote: > > The trick below I got from the book Mastering Cassandra. You have to set the > consistency to ALL for it to work. I thought you guys knew that one. > > > > From: Alain RODRIGUEZ [mailto:arodr...@gmail.com] > Sent: Thursday, April 04, 2019 8:46 AM > To: user cassandra.apache.org > Subject: Re: Assassinate fails > > > > Hi Alex, > > > > About previous advices: > > > > You might have inconsistent data in your system tables. Try setting the > consistency level to ALL, then do read query of system tables to force repair. > > > > System tables use the 'LocalStrategy', thus I don't think any repair would > happen for the system.* tables. Regardless the consistency you use. It should > not harm, but I really think it won't help. > > - To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org For additional commands, e-mail: user-h...@cassandra.apache.org
RE: Assassinate fails
The trick below I got from the book Mastering Cassandra. You have to set the consistency to ALL for it to work. I thought you guys knew that one. From: Alain RODRIGUEZ [mailto:arodr...@gmail.com] Sent: Thursday, April 04, 2019 8:46 AM To: user cassandra.apache.org Subject: Re: Assassinate fails Hi Alex, About previous advices: You might have inconsistent data in your system tables. Try setting the consistency level to ALL, then do read query of system tables to force repair. System tables use the 'LocalStrategy', thus I don't think any repair would happen for the system.* tables. Regardless the consistency you use. It should not harm, but I really think it won't help.
Re: Assassinate fails
Hi Alex, About previous advices: You might have inconsistent data in your system tables. Try setting the > consistency level to ALL, then do read query of system tables to force > repair. > System tables use the 'LocalStrategy', thus I don't think any repair would happen for the system.* tables. Regardless the consistency you use. It should not harm, but I really think it won't help. This will sound a little silly but, have you tried rolling the cluster? The other way around, the rolling restart does not sound that silly to me. I would try it before touching any other 'deeper' systems. It has indeed sometimes proven to do some magic for me as well. It's hard to guess on this kind of ghost node issues without being working on the machine (and sometimes even when accessing the machine I had some trouble =)). Also a rolling restart is an operation that should be easy to perform and with low risk (if everything is well configured). Other idea to explore: You can actually select the 'system.peers' table to see if all (other) nodes are referenced for each node. There should not be any dead nodes in there. By the way you will see that different nodes have slightly different data in system.peers, and are not in sync, thus no way to 'repair' that really. 'Select' is safe. If you delete non-existing 'peers' if any, If the node is dead anyway, this shouldn't hurt, but make sure you are doing the right thing you can easily break your cluster from there. I did not see an issue (a bug) of those for a while though. Normally you should not have to go that deep touching system tables. Then also nodes removed should be immediately removed from peers but persist for some time (7 days maybe?) in the gossip information (normally as 'LEFT'). This should not create the issue in 'nodetool describecluster' though. C*heers, --- Alain Rodriguez - al...@thelastpickle.com France / Spain The Last Pickle - Apache Cassandra Consulting http://www.thelastpickle.com Le jeu. 4 avr. 2019 à 16:09, Nick Hatfield a écrit : > This will sound a little silly but, have you tried rolling the cluster? > > > > $> nodetool flush; nodetool drain; service cassandra stop > $> ps aux | grep ‘cassandra’ > > # make sure the process actually dies. If not you may need to kill -9 > . Check first to see if nodetool can connect first, nodetool > gossipinfo. If the connection is live and listening on the port, then just > try re-running service cassandra stop again. Kill -9 as a last resort > > $> service cassandra start > $> nodetool netstats | grep ‘NORMAL’ # wait for this to return before > moving on to the next node. > > > > Restart them all using this method, then run nodetool status again and see > if it is listed. > > > > Once other thing, I recall you said something about having to terminate a > node and then replace it. Make sure that whichever node you did the > –Dreplace flag on, does not still have it set when you start cassandra on > it again! > > > > *From:* Alex [mailto:m...@aca-o.com] > *Sent:* Thursday, April 04, 2019 4:58 AM > *To:* user@cassandra.apache.org > *Subject:* Re: Assassinate fails > > > > Hi Anthony, > > Thanks for your help. > > I tried to run multiple times in quick succession but it fails with : > > -- StackTrace -- > java.lang.RuntimeException: Endpoint still alive: /192.168.1.18 > generation changed while trying to assassinate it > at > org.apache.cassandra.gms.Gossiper.assassinateEndpoint(Gossiper.java:592) > > I can see that the generation number for this node increases by 1 every > time I call nodetool assassinate ; and the command itself waits for 30 > seconds before assassinating node. When ran multiple times in quick > succession, the command fails because the generation number has been > changed by the previous instance. > > > > In 'nodetool gossipinfo', the node is marked as "LEFT" on every node. > > However, in 'nodetool describecluster', this node is marked as > "unreacheable" on 3 nodes out of 5. > > > > Alex > > > > Le 04.04.2019 00:56, Anthony Grasso a écrit : > > Hi Alex, > > > > We wrote a blog post on this topic late last year: > http://thelastpickle.com/blog/2018/09/18/assassinate.html. > > > > In short, you will need to run the assassinate command on each node > simultaneously a number of times in quick succession. This will generate a > number of messages requesting all nodes completely forget there used to be > an entry within the gossip state for the given IP address. > > > > Regards, > > Anthony > > > > On Thu, 4 Apr 2019 at 03:32, Alex wrote: > > Same result it seems: > Welcome to JMX terminal. Type "help"
Re: Assassinate fails
How long ago did you remove this host from the cluster? -- Jeff Jirsa > On Apr 4, 2019, at 8:09 AM, Nick Hatfield wrote: > > This will sound a little silly but, have you tried rolling the cluster? > > $> nodetool flush; nodetool drain; service cassandra stop > $> ps aux | grep ‘cassandra’ > > # make sure the process actually dies. If not you may need to kill -9 . > Check first to see if nodetool can connect first, nodetool gossipinfo. If the > connection is live and listening on the port, then just try re-running > service cassandra stop again. Kill -9 as a last resort > > $> service cassandra start > $> nodetool netstats | grep ‘NORMAL’ # wait for this to return before moving > on to the next node. > > Restart them all using this method, then run nodetool status again and see if > it is listed. > > Once other thing, I recall you said something about having to terminate a > node and then replace it. Make sure that whichever node you did the –Dreplace > flag on, does not still have it set when you start cassandra on it again! > > From: Alex [mailto:m...@aca-o.com] > Sent: Thursday, April 04, 2019 4:58 AM > To: user@cassandra.apache.org > Subject: Re: Assassinate fails > > Hi Anthony, > > Thanks for your help. > > I tried to run multiple times in quick succession but it fails with : > > -- StackTrace -- > java.lang.RuntimeException: Endpoint still alive: /192.168.1.18 generation > changed while trying to assassinate it > at > org.apache.cassandra.gms.Gossiper.assassinateEndpoint(Gossiper.java:592) > > I can see that the generation number for this node increases by 1 every time > I call nodetool assassinate ; and the command itself waits for 30 seconds > before assassinating node. When ran multiple times in quick succession, the > command fails because the generation number has been changed by the previous > instance. > > > > In 'nodetool gossipinfo', the node is marked as "LEFT" on every node. > > However, in 'nodetool describecluster', this node is marked as "unreacheable" > on 3 nodes out of 5. > > > > Alex > > > > Le 04.04.2019 00:56, Anthony Grasso a écrit : > > Hi Alex, > > We wrote a blog post on this topic late last year: > http://thelastpickle.com/blog/2018/09/18/assassinate.html. > > In short, you will need to run the assassinate command on each node > simultaneously a number of times in quick succession. This will generate a > number of messages requesting all nodes completely forget there used to be an > entry within the gossip state for the given IP address. > > Regards, > Anthony > > On Thu, 4 Apr 2019 at 03:32, Alex wrote: > Same result it seems: > Welcome to JMX terminal. Type "help" for available commands. > $>open localhost:7199 > #Connection to localhost:7199 is opened > $>bean org.apache.cassandra.net:type=Gossiper > #bean is set to org.apache.cassandra.net:type=Gossiper > $>run unsafeAssassinateEndpoint 192.168.1.18 > #calling operation unsafeAssassinateEndpoint of mbean > org.apache.cassandra.net:type=Gossiper > #RuntimeMBeanException: java.lang.NullPointerException > > > There not much more to see in log files : > WARN [RMI TCP Connection(10)-127.0.0.1] 2019-04-03 16:25:13,626 > Gossiper.java:575 - Assassinating /192.168.1.18 via gossip > INFO [RMI TCP Connection(10)-127.0.0.1] 2019-04-03 16:25:13,627 > Gossiper.java:585 - Sleeping for 3ms to ensure /192.168.1.18 does > not change > INFO [RMI TCP Connection(10)-127.0.0.1] 2019-04-03 16:25:43,628 > Gossiper.java:1029 - InetAddress /192.168.1.18 is now DOWN > INFO [RMI TCP Connection(10)-127.0.0.1] 2019-04-03 16:25:43,631 > StorageService.java:2324 - Removing tokens [..] for /192.168.1.18 > > > > > Le 03.04.2019 17:10, Nick Hatfield a écrit : > > Run assassinate the old way. I works very well... > > > > wget -q -O jmxterm.jar > > http://downloads.sourceforge.net/cyclops-group/jmxterm-1.0-alpha-4-uber.jar > > > > java -jar ./jmxterm.jar > > > > $>open localhost:7199 > > > > $>bean org.apache.cassandra.net:type=Gossiper > > > > $>run unsafeAssassinateEndpoint 192.168.1.18 > > > > $>quit > > > > > > Happy deleting > > > > -Original Message- > > From: Alex [mailto:m...@aca-o.com] > > Sent: Wednesday, April 03, 2019 10:42 AM > > To: user@cassandra.apache.org > > Subject: Assassinate fails > > > > Hello, > > > > Short story: > > - I had to replace a dead node in my cluster > > - 1 week afte
RE: Assassinate fails
This will sound a little silly but, have you tried rolling the cluster? $> nodetool flush; nodetool drain; service cassandra stop $> ps aux | grep ‘cassandra’ # make sure the process actually dies. If not you may need to kill -9 . Check first to see if nodetool can connect first, nodetool gossipinfo. If the connection is live and listening on the port, then just try re-running service cassandra stop again. Kill -9 as a last resort $> service cassandra start $> nodetool netstats | grep ‘NORMAL’ # wait for this to return before moving on to the next node. Restart them all using this method, then run nodetool status again and see if it is listed. Once other thing, I recall you said something about having to terminate a node and then replace it. Make sure that whichever node you did the –Dreplace flag on, does not still have it set when you start cassandra on it again! From: Alex [mailto:m...@aca-o.com] Sent: Thursday, April 04, 2019 4:58 AM To: user@cassandra.apache.org Subject: Re: Assassinate fails Hi Anthony, Thanks for your help. I tried to run multiple times in quick succession but it fails with : -- StackTrace -- java.lang.RuntimeException: Endpoint still alive: /192.168.1.18 generation changed while trying to assassinate it at org.apache.cassandra.gms.Gossiper.assassinateEndpoint(Gossiper.java:592) I can see that the generation number for this node increases by 1 every time I call nodetool assassinate ; and the command itself waits for 30 seconds before assassinating node. When ran multiple times in quick succession, the command fails because the generation number has been changed by the previous instance. In 'nodetool gossipinfo', the node is marked as "LEFT" on every node. However, in 'nodetool describecluster', this node is marked as "unreacheable" on 3 nodes out of 5. Alex Le 04.04.2019 00:56, Anthony Grasso a écrit : Hi Alex, We wrote a blog post on this topic late last year: http://thelastpickle.com/blog/2018/09/18/assassinate.html. In short, you will need to run the assassinate command on each node simultaneously a number of times in quick succession. This will generate a number of messages requesting all nodes completely forget there used to be an entry within the gossip state for the given IP address. Regards, Anthony On Thu, 4 Apr 2019 at 03:32, Alex mailto:m...@aca-o.com>> wrote: Same result it seems: Welcome to JMX terminal. Type "help" for available commands. $>open localhost:7199 #Connection to localhost:7199 is opened $>bean org.apache.cassandra.net:type=Gossiper #bean is set to org.apache.cassandra.net:type=Gossiper $>run unsafeAssassinateEndpoint 192.168.1.18 #calling operation unsafeAssassinateEndpoint of mbean org.apache.cassandra.net:type=Gossiper #RuntimeMBeanException: java.lang.NullPointerException There not much more to see in log files : WARN [RMI TCP Connection(10)-127.0.0.1] 2019-04-03 16:25:13,626 Gossiper.java:575 - Assassinating /192.168.1.18<http://192.168.1.18> via gossip INFO [RMI TCP Connection(10)-127.0.0.1] 2019-04-03 16:25:13,627 Gossiper.java:585 - Sleeping for 3ms to ensure /192.168.1.18<http://192.168.1.18> does not change INFO [RMI TCP Connection(10)-127.0.0.1] 2019-04-03 16:25:43,628 Gossiper.java:1029 - InetAddress /192.168.1.18<http://192.168.1.18> is now DOWN INFO [RMI TCP Connection(10)-127.0.0.1] 2019-04-03 16:25:43,631 StorageService.java:2324 - Removing tokens [..] for /192.168.1.18<http://192.168.1.18> Le 03.04.2019 17:10, Nick Hatfield a écrit : > Run assassinate the old way. I works very well... > > wget -q -O jmxterm.jar > http://downloads.sourceforge.net/cyclops-group/jmxterm-1.0-alpha-4-uber.jar > > java -jar ./jmxterm.jar > > $>open localhost:7199 > > $>bean org.apache.cassandra.net:type=Gossiper > > $>run unsafeAssassinateEndpoint 192.168.1.18 > > $>quit > > > Happy deleting > > -Original Message- > From: Alex [mailto:m...@aca-o.com<mailto:m...@aca-o.com>] > Sent: Wednesday, April 03, 2019 10:42 AM > To: user@cassandra.apache.org<mailto:user@cassandra.apache.org> > Subject: Assassinate fails > > Hello, > > Short story: > - I had to replace a dead node in my cluster > - 1 week after, dead node is still seen as DN by 3 out of 5 nodes > - dead node has null host_id > - assassinate on dead node fails with error > > How can I get rid of this dead node ? > > > Long story: > I had a 3 nodes cluster (Cassandra 3.9) ; one node went dead. I built > a new node from scratch and "replaced" the dead node using the > information from this page > https://docs.datastax.com/en/cassandra/3.0/cassandra/operations/opsReplaceNode.html. > It looked like the replacement went ok. > > I added two more nodes to strengthen the cluster. > > A few
RE: Assassinate fails
Hi Alex, You might have inconsistent data in your system tables. Try setting the consistency level to ALL, then do read query of system tables to force repair. Kenneth Brotman From: Alex [mailto:m...@aca-o.com] Sent: Thursday, April 04, 2019 1:58 AM To: user@cassandra.apache.org Subject: Re: Assassinate fails Hi Anthony, Thanks for your help. I tried to run multiple times in quick succession but it fails with : -- StackTrace -- java.lang.RuntimeException: Endpoint still alive: /192.168.1.18 generation changed while trying to assassinate it at org.apache.cassandra.gms.Gossiper.assassinateEndpoint(Gossiper.java:592) I can see that the generation number for this node increases by 1 every time I call nodetool assassinate ; and the command itself waits for 30 seconds before assassinating node. When ran multiple times in quick succession, the command fails because the generation number has been changed by the previous instance. In 'nodetool gossipinfo', the node is marked as "LEFT" on every node. However, in 'nodetool describecluster', this node is marked as "unreacheable" on 3 nodes out of 5. Alex Le 04.04.2019 00:56, Anthony Grasso a écrit : Hi Alex, We wrote a blog post on this topic late last year: http://thelastpickle.com/blog/2018/09/18/assassinate.html. In short, you will need to run the assassinate command on each node simultaneously a number of times in quick succession. This will generate a number of messages requesting all nodes completely forget there used to be an entry within the gossip state for the given IP address. Regards, Anthony On Thu, 4 Apr 2019 at 03:32, Alex wrote: Same result it seems: Welcome to JMX terminal. Type "help" for available commands. $>open localhost:7199 #Connection to localhost:7199 is opened $>bean org.apache.cassandra.net:type=Gossiper #bean is set to org.apache.cassandra.net:type=Gossiper $>run unsafeAssassinateEndpoint 192.168.1.18 #calling operation unsafeAssassinateEndpoint of mbean org.apache.cassandra.net:type=Gossiper #RuntimeMBeanException: java.lang.NullPointerException There not much more to see in log files : WARN [RMI TCP Connection(10)-127.0.0.1] 2019-04-03 16:25:13,626 Gossiper.java:575 - Assassinating /192.168.1.18 via gossip INFO [RMI TCP Connection(10)-127.0.0.1] 2019-04-03 16:25:13,627 Gossiper.java:585 - Sleeping for 3ms to ensure /192.168.1.18 does not change INFO [RMI TCP Connection(10)-127.0.0.1] 2019-04-03 16:25:43,628 Gossiper.java:1029 - InetAddress /192.168.1.18 is now DOWN INFO [RMI TCP Connection(10)-127.0.0.1] 2019-04-03 16:25:43,631 StorageService.java:2324 - Removing tokens [..] for /192.168.1.18 Le 03.04.2019 17:10, Nick Hatfield a écrit : > Run assassinate the old way. I works very well... > > wget -q -O jmxterm.jar > http://downloads.sourceforge.net/cyclops-group/jmxterm-1.0-alpha-4-uber.jar > > java -jar ./jmxterm.jar > > $>open localhost:7199 > > $>bean org.apache.cassandra.net:type=Gossiper > > $>run unsafeAssassinateEndpoint 192.168.1.18 > > $>quit > > > Happy deleting > > -Original Message- > From: Alex [mailto:m...@aca-o.com] > Sent: Wednesday, April 03, 2019 10:42 AM > To: user@cassandra.apache.org > Subject: Assassinate fails > > Hello, > > Short story: > - I had to replace a dead node in my cluster > - 1 week after, dead node is still seen as DN by 3 out of 5 nodes > - dead node has null host_id > - assassinate on dead node fails with error > > How can I get rid of this dead node ? > > > Long story: > I had a 3 nodes cluster (Cassandra 3.9) ; one node went dead. I built > a new node from scratch and "replaced" the dead node using the > information from this page > https://docs.datastax.com/en/cassandra/3.0/cassandra/operations/opsReplaceNode.html. > It looked like the replacement went ok. > > I added two more nodes to strengthen the cluster. > > A few days have passed and the dead node is still visible and marked > as "down" on 3 of 5 nodes in nodetool status: > > -- Address Load Tokens Owns (effective) Host ID > Rack > UN 192.168.1.9 16 GiB 256 35.0% > 76223d4c-9d9f-417f-be27-cebb791cddcc rack1 > UN 192.168.1.12 16.09 GiB 256 34.0% > 719601e2-54a6-440e-a379-c9cf2dc20564 rack1 > UN 192.168.1.14 14.16 GiB 256 32.6% > d8017a03-7e4e-47b7-89b9-cd9ec472d74f rack1 > UN 192.168.1.17 15.4 GiB 256 34.1% > fa238b21-1db1-47dc-bfb7-beedc6c9967a rack1 > DN 192.168.1.18 24.3 GiB 256 33.7% null > rack1 > UN 192.168.1.22 19.06 GiB 256 30.7% > 09d24557-4e98-44c3-8c9d-53c4c31066e1 rack1 > > Its host ID is nul
Re: Assassinate fails
Hi Anthony, Thanks for your help. I tried to run multiple times in quick succession but it fails with : -- StackTrace -- java.lang.RuntimeException: Endpoint still alive: /192.168.1.18 generation changed while trying to assassinate it at org.apache.cassandra.gms.Gossiper.assassinateEndpoint(Gossiper.java:592) I can see that the generation number for this node increases by 1 every time I call nodetool assassinate ; and the command itself waits for 30 seconds before assassinating node. When ran multiple times in quick succession, the command fails because the generation number has been changed by the previous instance. In 'nodetool gossipinfo', the node is marked as "LEFT" on every node. However, in 'nodetool describecluster', this node is marked as "unreacheable" on 3 nodes out of 5. Alex Le 04.04.2019 00:56, Anthony Grasso a écrit : > Hi Alex, > > We wrote a blog post on this topic late last year: > http://thelastpickle.com/blog/2018/09/18/assassinate.html. > > In short, you will need to run the assassinate command on each node > simultaneously a number of times in quick succession. This will generate a > number of messages requesting all nodes completely forget there used to be an > entry within the gossip state for the given IP address. > > Regards, > Anthony > > On Thu, 4 Apr 2019 at 03:32, Alex wrote: > >> Same result it seems: >> Welcome to JMX terminal. Type "help" for available commands. >> $>open localhost:7199 >> #Connection to localhost:7199 is opened >> $>bean org.apache.cassandra.net:type=Gossiper >> #bean is set to org.apache.cassandra.net:type=Gossiper >> $>run unsafeAssassinateEndpoint 192.168.1.18 >> #calling operation unsafeAssassinateEndpoint of mbean >> org.apache.cassandra.net:type=Gossiper >> #RuntimeMBeanException: java.lang.NullPointerException >> >> There not much more to see in log files : >> WARN [RMI TCP Connection(10)-127.0.0.1] 2019-04-03 16:25:13,626 >> Gossiper.java:575 - Assassinating /192.168.1.18 [1] via gossip >> INFO [RMI TCP Connection(10)-127.0.0.1] 2019-04-03 16:25:13,627 >> Gossiper.java:585 - Sleeping for 3ms to ensure /192.168.1.18 [1] does >> not change >> INFO [RMI TCP Connection(10)-127.0.0.1] 2019-04-03 16:25:43,628 >> Gossiper.java:1029 - InetAddress /192.168.1.18 [1] is now DOWN >> INFO [RMI TCP Connection(10)-127.0.0.1] 2019-04-03 16:25:43,631 >> StorageService.java:2324 - Removing tokens [..] for /192.168.1.18 [1] >> >> Le 03.04.2019 17:10, Nick Hatfield a écrit : >>> Run assassinate the old way. I works very well... >>> >>> wget -q -O jmxterm.jar >>> http://downloads.sourceforge.net/cyclops-group/jmxterm-1.0-alpha-4-uber.jar >>> >>> java -jar ./jmxterm.jar >>> >>> $>open localhost:7199 >>> >>> $>bean org.apache.cassandra.net:type=Gossiper >>> >>> $>run unsafeAssassinateEndpoint 192.168.1.18 >>> >>> $>quit >>> >>> >>> Happy deleting >>> >>> -Original Message- >>> From: Alex [mailto:m...@aca-o.com] >>> Sent: Wednesday, April 03, 2019 10:42 AM >>> To: user@cassandra.apache.org >>> Subject: Assassinate fails >>> >>> Hello, >>> >>> Short story: >>> - I had to replace a dead node in my cluster >>> - 1 week after, dead node is still seen as DN by 3 out of 5 nodes >>> - dead node has null host_id >>> - assassinate on dead node fails with error >>> >>> How can I get rid of this dead node ? >>> >>> >>> Long story: >>> I had a 3 nodes cluster (Cassandra 3.9) ; one node went dead. I built >>> a new node from scratch and "replaced" the dead node using the >>> information from this page >>> https://docs.datastax.com/en/cassandra/3.0/cassandra/operations/opsReplaceNode.html. >>> It looked like the replacement went ok. >>> >>> I added two more nodes to strengthen the cluster. >>> >>> A few days have passed and the dead node is still visible and marked >>> as "down" on 3 of 5 nodes in nodetool status: >>> >>> -- Address Load Tokens Owns (effective) Host ID >>> Rack >>> UN 192.168.1.9 16 GiB 256 35.0% >>> 76223d4c-9d9f-417f-be27-cebb791cddcc rack1 >>> UN 192.168.1.12 16.09 GiB 256 34.0% >>> 719601e2-54a6-440e-a379-c9cf2dc20564 rack1 >>> UN 192.168.1.14 14.16 GiB 256 32.6% >>> d8017a03-7e4e-47b7-8
Re: Assassinate fails
Hi Alex, We wrote a blog post on this topic late last year: http://thelastpickle.com/blog/2018/09/18/assassinate.html. In short, you will need to run the assassinate command on each node simultaneously a number of times in quick succession. This will generate a number of messages requesting all nodes completely forget there used to be an entry within the gossip state for the given IP address. Regards, Anthony On Thu, 4 Apr 2019 at 03:32, Alex wrote: > Same result it seems: > Welcome to JMX terminal. Type "help" for available commands. > $>open localhost:7199 > #Connection to localhost:7199 is opened > $>bean org.apache.cassandra.net:type=Gossiper > #bean is set to org.apache.cassandra.net:type=Gossiper > $>run unsafeAssassinateEndpoint 192.168.1.18 > #calling operation unsafeAssassinateEndpoint of mbean > org.apache.cassandra.net:type=Gossiper > #RuntimeMBeanException: java.lang.NullPointerException > > > There not much more to see in log files : > WARN [RMI TCP Connection(10)-127.0.0.1] 2019-04-03 16:25:13,626 > Gossiper.java:575 - Assassinating /192.168.1.18 via gossip > INFO [RMI TCP Connection(10)-127.0.0.1] 2019-04-03 16:25:13,627 > Gossiper.java:585 - Sleeping for 3ms to ensure /192.168.1.18 does > not change > INFO [RMI TCP Connection(10)-127.0.0.1] 2019-04-03 16:25:43,628 > Gossiper.java:1029 - InetAddress /192.168.1.18 is now DOWN > INFO [RMI TCP Connection(10)-127.0.0.1] 2019-04-03 16:25:43,631 > StorageService.java:2324 - Removing tokens [..] for /192.168.1.18 > > > > > Le 03.04.2019 17:10, Nick Hatfield a écrit : > > Run assassinate the old way. I works very well... > > > > wget -q -O jmxterm.jar > > > http://downloads.sourceforge.net/cyclops-group/jmxterm-1.0-alpha-4-uber.jar > > > > java -jar ./jmxterm.jar > > > > $>open localhost:7199 > > > > $>bean org.apache.cassandra.net:type=Gossiper > > > > $>run unsafeAssassinateEndpoint 192.168.1.18 > > > > $>quit > > > > > > Happy deleting > > > > -Original Message- > > From: Alex [mailto:m...@aca-o.com] > > Sent: Wednesday, April 03, 2019 10:42 AM > > To: user@cassandra.apache.org > > Subject: Assassinate fails > > > > Hello, > > > > Short story: > > - I had to replace a dead node in my cluster > > - 1 week after, dead node is still seen as DN by 3 out of 5 nodes > > - dead node has null host_id > > - assassinate on dead node fails with error > > > > How can I get rid of this dead node ? > > > > > > Long story: > > I had a 3 nodes cluster (Cassandra 3.9) ; one node went dead. I built > > a new node from scratch and "replaced" the dead node using the > > information from this page > > > https://docs.datastax.com/en/cassandra/3.0/cassandra/operations/opsReplaceNode.html > . > > It looked like the replacement went ok. > > > > I added two more nodes to strengthen the cluster. > > > > A few days have passed and the dead node is still visible and marked > > as "down" on 3 of 5 nodes in nodetool status: > > > > -- Address Load Tokens Owns (effective) Host ID > > Rack > > UN 192.168.1.9 16 GiB 256 35.0% > > 76223d4c-9d9f-417f-be27-cebb791cddcc rack1 > > UN 192.168.1.12 16.09 GiB 256 34.0% > > 719601e2-54a6-440e-a379-c9cf2dc20564 rack1 > > UN 192.168.1.14 14.16 GiB 256 32.6% > > d8017a03-7e4e-47b7-89b9-cd9ec472d74f rack1 > > UN 192.168.1.17 15.4 GiB 256 34.1% > > fa238b21-1db1-47dc-bfb7-beedc6c9967a rack1 > > DN 192.168.1.18 24.3 GiB 256 33.7% null > > rack1 > > UN 192.168.1.22 19.06 GiB 256 30.7% > > 09d24557-4e98-44c3-8c9d-53c4c31066e1 rack1 > > > > Its host ID is null, so I cannot use nodetool removenode. Moreover > > nodetool assassinate 192.168.1.18 fails with : > > > > error: null > > -- StackTrace -- > > java.lang.NullPointerException > > > > And in system.log: > > > > INFO [RMI TCP Connection(16)-127.0.0.1] 2019-03-27 17:39:38,595 > > Gossiper.java:585 - Sleeping for 3ms to ensure /192.168.1.18 does > > not change INFO [CompactionExecutor:547] 2019-03-27 17:39:38,669 > > AutoSavingCache.java:393 - Saved KeyCache (27316 items) in 163 ms INFO > > [IndexSummaryManager:1] 2019-03-27 17:40:03,620 > > IndexSummaryRedistribution.java:75 - Redistributing index summaries > > INFO [RMI TCP Connection(16)-127.0.0.1] 2019-03-27 17:40:08,597 > > Gossiper.java:1029
Re: Assassinate fails
Same result it seems: Welcome to JMX terminal. Type "help" for available commands. $>open localhost:7199 #Connection to localhost:7199 is opened $>bean org.apache.cassandra.net:type=Gossiper #bean is set to org.apache.cassandra.net:type=Gossiper $>run unsafeAssassinateEndpoint 192.168.1.18 #calling operation unsafeAssassinateEndpoint of mbean org.apache.cassandra.net:type=Gossiper #RuntimeMBeanException: java.lang.NullPointerException There not much more to see in log files : WARN [RMI TCP Connection(10)-127.0.0.1] 2019-04-03 16:25:13,626 Gossiper.java:575 - Assassinating /192.168.1.18 via gossip INFO [RMI TCP Connection(10)-127.0.0.1] 2019-04-03 16:25:13,627 Gossiper.java:585 - Sleeping for 3ms to ensure /192.168.1.18 does not change INFO [RMI TCP Connection(10)-127.0.0.1] 2019-04-03 16:25:43,628 Gossiper.java:1029 - InetAddress /192.168.1.18 is now DOWN INFO [RMI TCP Connection(10)-127.0.0.1] 2019-04-03 16:25:43,631 StorageService.java:2324 - Removing tokens [..] for /192.168.1.18 Le 03.04.2019 17:10, Nick Hatfield a écrit : Run assassinate the old way. I works very well... wget -q -O jmxterm.jar http://downloads.sourceforge.net/cyclops-group/jmxterm-1.0-alpha-4-uber.jar java -jar ./jmxterm.jar $>open localhost:7199 $>bean org.apache.cassandra.net:type=Gossiper $>run unsafeAssassinateEndpoint 192.168.1.18 $>quit Happy deleting -Original Message- From: Alex [mailto:m...@aca-o.com] Sent: Wednesday, April 03, 2019 10:42 AM To: user@cassandra.apache.org Subject: Assassinate fails Hello, Short story: - I had to replace a dead node in my cluster - 1 week after, dead node is still seen as DN by 3 out of 5 nodes - dead node has null host_id - assassinate on dead node fails with error How can I get rid of this dead node ? Long story: I had a 3 nodes cluster (Cassandra 3.9) ; one node went dead. I built a new node from scratch and "replaced" the dead node using the information from this page https://docs.datastax.com/en/cassandra/3.0/cassandra/operations/opsReplaceNode.html. It looked like the replacement went ok. I added two more nodes to strengthen the cluster. A few days have passed and the dead node is still visible and marked as "down" on 3 of 5 nodes in nodetool status: -- Address Load Tokens Owns (effective) Host ID Rack UN 192.168.1.9 16 GiB 256 35.0% 76223d4c-9d9f-417f-be27-cebb791cddcc rack1 UN 192.168.1.12 16.09 GiB 256 34.0% 719601e2-54a6-440e-a379-c9cf2dc20564 rack1 UN 192.168.1.14 14.16 GiB 256 32.6% d8017a03-7e4e-47b7-89b9-cd9ec472d74f rack1 UN 192.168.1.17 15.4 GiB 256 34.1% fa238b21-1db1-47dc-bfb7-beedc6c9967a rack1 DN 192.168.1.18 24.3 GiB 256 33.7% null rack1 UN 192.168.1.22 19.06 GiB 256 30.7% 09d24557-4e98-44c3-8c9d-53c4c31066e1 rack1 Its host ID is null, so I cannot use nodetool removenode. Moreover nodetool assassinate 192.168.1.18 fails with : error: null -- StackTrace -- java.lang.NullPointerException And in system.log: INFO [RMI TCP Connection(16)-127.0.0.1] 2019-03-27 17:39:38,595 Gossiper.java:585 - Sleeping for 3ms to ensure /192.168.1.18 does not change INFO [CompactionExecutor:547] 2019-03-27 17:39:38,669 AutoSavingCache.java:393 - Saved KeyCache (27316 items) in 163 ms INFO [IndexSummaryManager:1] 2019-03-27 17:40:03,620 IndexSummaryRedistribution.java:75 - Redistributing index summaries INFO [RMI TCP Connection(16)-127.0.0.1] 2019-03-27 17:40:08,597 Gossiper.java:1029 - InetAddress /192.168.1.18 is now DOWN INFO [RMI TCP Connection(16)-127.0.0.1] 2019-03-27 17:40:08,599 StorageService.java:2324 - Removing tokens [-1061369577393671924,...] ERROR [GossipStage:1] 2019-03-27 17:40:08,600 CassandraDaemon.java:226 - Exception in thread Thread[GossipStage:1,5,main] java.lang.NullPointerException: null In system.peers, the dead node shows and has the same ID as the replacing node : cqlsh> select peer, host_id from system.peers; peer | host_id --+-- 192.168.1.18 | 09d24557-4e98-44c3-8c9d-53c4c31066e1 192.168.1.22 | 09d24557-4e98-44c3-8c9d-53c4c31066e1 192.168.1.9 | 76223d4c-9d9f-417f-be27-cebb791cddcc 192.168.1.14 | d8017a03-7e4e-47b7-89b9-cd9ec472d74f 192.168.1.12 | 719601e2-54a6-440e-a379-c9cf2dc20564 Dead node and replacing node have different tokens in system.peers. I should add that I also tried decommission on a node that still 192.168.1.18 in its peers. - it is still marked as "leaving" 5 days later. Nothing in notetool netstats or nodetool compactionstats. Thank you for taking the time to read this. Hope you can help. Alex - To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
RE: Assassinate fails
Run assassinate the old way. I works very well... wget -q -O jmxterm.jar http://downloads.sourceforge.net/cyclops-group/jmxterm-1.0-alpha-4-uber.jar java -jar ./jmxterm.jar $>open localhost:7199 $>bean org.apache.cassandra.net:type=Gossiper $>run unsafeAssassinateEndpoint 192.168.1.18 $>quit Happy deleting -Original Message- From: Alex [mailto:m...@aca-o.com] Sent: Wednesday, April 03, 2019 10:42 AM To: user@cassandra.apache.org Subject: Assassinate fails Hello, Short story: - I had to replace a dead node in my cluster - 1 week after, dead node is still seen as DN by 3 out of 5 nodes - dead node has null host_id - assassinate on dead node fails with error How can I get rid of this dead node ? Long story: I had a 3 nodes cluster (Cassandra 3.9) ; one node went dead. I built a new node from scratch and "replaced" the dead node using the information from this page https://docs.datastax.com/en/cassandra/3.0/cassandra/operations/opsReplaceNode.html. It looked like the replacement went ok. I added two more nodes to strengthen the cluster. A few days have passed and the dead node is still visible and marked as "down" on 3 of 5 nodes in nodetool status: -- Address Load Tokens Owns (effective) Host ID Rack UN 192.168.1.9 16 GiB 256 35.0% 76223d4c-9d9f-417f-be27-cebb791cddcc rack1 UN 192.168.1.12 16.09 GiB 256 34.0% 719601e2-54a6-440e-a379-c9cf2dc20564 rack1 UN 192.168.1.14 14.16 GiB 256 32.6% d8017a03-7e4e-47b7-89b9-cd9ec472d74f rack1 UN 192.168.1.17 15.4 GiB 256 34.1% fa238b21-1db1-47dc-bfb7-beedc6c9967a rack1 DN 192.168.1.18 24.3 GiB 256 33.7% null rack1 UN 192.168.1.22 19.06 GiB 256 30.7% 09d24557-4e98-44c3-8c9d-53c4c31066e1 rack1 Its host ID is null, so I cannot use nodetool removenode. Moreover nodetool assassinate 192.168.1.18 fails with : error: null -- StackTrace -- java.lang.NullPointerException And in system.log: INFO [RMI TCP Connection(16)-127.0.0.1] 2019-03-27 17:39:38,595 Gossiper.java:585 - Sleeping for 3ms to ensure /192.168.1.18 does not change INFO [CompactionExecutor:547] 2019-03-27 17:39:38,669 AutoSavingCache.java:393 - Saved KeyCache (27316 items) in 163 ms INFO [IndexSummaryManager:1] 2019-03-27 17:40:03,620 IndexSummaryRedistribution.java:75 - Redistributing index summaries INFO [RMI TCP Connection(16)-127.0.0.1] 2019-03-27 17:40:08,597 Gossiper.java:1029 - InetAddress /192.168.1.18 is now DOWN INFO [RMI TCP Connection(16)-127.0.0.1] 2019-03-27 17:40:08,599 StorageService.java:2324 - Removing tokens [-1061369577393671924,...] ERROR [GossipStage:1] 2019-03-27 17:40:08,600 CassandraDaemon.java:226 - Exception in thread Thread[GossipStage:1,5,main] java.lang.NullPointerException: null In system.peers, the dead node shows and has the same ID as the replacing node : cqlsh> select peer, host_id from system.peers; peer | host_id --+-- 192.168.1.18 | 09d24557-4e98-44c3-8c9d-53c4c31066e1 192.168.1.22 | 09d24557-4e98-44c3-8c9d-53c4c31066e1 192.168.1.9 | 76223d4c-9d9f-417f-be27-cebb791cddcc 192.168.1.14 | d8017a03-7e4e-47b7-89b9-cd9ec472d74f 192.168.1.12 | 719601e2-54a6-440e-a379-c9cf2dc20564 Dead node and replacing node have different tokens in system.peers. I should add that I also tried decommission on a node that still 192.168.1.18 in its peers. - it is still marked as "leaving" 5 days later. Nothing in notetool netstats or nodetool compactionstats. Thank you for taking the time to read this. Hope you can help. Alex - To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org For additional commands, e-mail: user-h...@cassandra.apache.org - To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org For additional commands, e-mail: user-h...@cassandra.apache.org
Assassinate fails
Hello, Short story: - I had to replace a dead node in my cluster - 1 week after, dead node is still seen as DN by 3 out of 5 nodes - dead node has null host_id - assassinate on dead node fails with error How can I get rid of this dead node ? Long story: I had a 3 nodes cluster (Cassandra 3.9) ; one node went dead. I built a new node from scratch and "replaced" the dead node using the information from this page https://docs.datastax.com/en/cassandra/3.0/cassandra/operations/opsReplaceNode.html. It looked like the replacement went ok. I added two more nodes to strengthen the cluster. A few days have passed and the dead node is still visible and marked as "down" on 3 of 5 nodes in nodetool status: -- Address Load Tokens Owns (effective) Host ID Rack UN 192.168.1.9 16 GiB 256 35.0% 76223d4c-9d9f-417f-be27-cebb791cddcc rack1 UN 192.168.1.12 16.09 GiB 256 34.0% 719601e2-54a6-440e-a379-c9cf2dc20564 rack1 UN 192.168.1.14 14.16 GiB 256 32.6% d8017a03-7e4e-47b7-89b9-cd9ec472d74f rack1 UN 192.168.1.17 15.4 GiB 256 34.1% fa238b21-1db1-47dc-bfb7-beedc6c9967a rack1 DN 192.168.1.18 24.3 GiB 256 33.7% null rack1 UN 192.168.1.22 19.06 GiB 256 30.7% 09d24557-4e98-44c3-8c9d-53c4c31066e1 rack1 Its host ID is null, so I cannot use nodetool removenode. Moreover nodetool assassinate 192.168.1.18 fails with : error: null -- StackTrace -- java.lang.NullPointerException And in system.log: INFO [RMI TCP Connection(16)-127.0.0.1] 2019-03-27 17:39:38,595 Gossiper.java:585 - Sleeping for 3ms to ensure /192.168.1.18 does not change INFO [CompactionExecutor:547] 2019-03-27 17:39:38,669 AutoSavingCache.java:393 - Saved KeyCache (27316 items) in 163 ms INFO [IndexSummaryManager:1] 2019-03-27 17:40:03,620 IndexSummaryRedistribution.java:75 - Redistributing index summaries INFO [RMI TCP Connection(16)-127.0.0.1] 2019-03-27 17:40:08,597 Gossiper.java:1029 - InetAddress /192.168.1.18 is now DOWN INFO [RMI TCP Connection(16)-127.0.0.1] 2019-03-27 17:40:08,599 StorageService.java:2324 - Removing tokens [-1061369577393671924,...] ERROR [GossipStage:1] 2019-03-27 17:40:08,600 CassandraDaemon.java:226 - Exception in thread Thread[GossipStage:1,5,main] java.lang.NullPointerException: null In system.peers, the dead node shows and has the same ID as the replacing node : cqlsh> select peer, host_id from system.peers; peer | host_id --+-- 192.168.1.18 | 09d24557-4e98-44c3-8c9d-53c4c31066e1 192.168.1.22 | 09d24557-4e98-44c3-8c9d-53c4c31066e1 192.168.1.9 | 76223d4c-9d9f-417f-be27-cebb791cddcc 192.168.1.14 | d8017a03-7e4e-47b7-89b9-cd9ec472d74f 192.168.1.12 | 719601e2-54a6-440e-a379-c9cf2dc20564 Dead node and replacing node have different tokens in system.peers. I should add that I also tried decommission on a node that still 192.168.1.18 in its peers. - it is still marked as "leaving" 5 days later. Nothing in notetool netstats or nodetool compactionstats. Thank you for taking the time to read this. Hope you can help. Alex - To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org For additional commands, e-mail: user-h...@cassandra.apache.org