Re: Assassinate fails

2019-08-29 Thread Alain RODRIGUEZ
 maybe read the ticket above etc. When you're ready, backup the data,
> prepare well the DELETE command and observe how 1 node reacts to the fix
> first.
>
> As you can see, I think it's the 'good' fix, but I'm not comfortable with
> this operation. And you should not be either :).
> I would say, arbitrary to share my feeling about this operation, that
> there is 95% chances this does not hurt, 90% chances to fix the issue with
> that, but if something goes wrong, if we are in the 5% were it does not go
> well, there is a not negligible probability that you will destroy your
> cluster in a very bad way. I guess I try to say be careful, watch your
> step, make sure you remove the good line, ensure it works on one node with
> no harm.
> I shared my feeling and I would try this fix. But it's ultimately
> your responsibility and I won't be behind the machine when you'll fix it.
> None of us will.
>
> Good luck ! :)
>
> C*heers,
> ---
> Alain Rodriguez - al...@thelastpickle.com
> France / Spain
>
> The Last Pickle - Apache Cassandra Consulting
> http://www.thelastpickle.com
>
>
>
> Le jeu. 4 avr. 2019 à 19:29, Kenneth Brotman 
> a écrit :
>
>> Alex,
>>
>> According to this TLP article
>> http://thelastpickle.com/blog/2018/09/18/assassinate.html :
>>
>> Note that the LEFT status should stick around for 72 hours to ensure all
>> nodes come to the consensus that the node has been removed. So please don't
>> rush things if that's the case. Again, it's only cosmetic.
>>
>> If a gossip state will not forget a node that was removed from the
>> cluster more than a week ago:
>>
>> Login to each node within the Cassandra cluster.
>> Download jmxterm on each node, if nodetool assassinate is not an
>> option.
>> Run nodetool assassinate, or the unsafeAssassinateEndpoint command,
>> multiple times in quick succession.
>> I typically recommend running the command 3-5 times within 2
>> seconds.
>> I understand that sometimes the command takes time to return, so
>> the "2 seconds" suggestion is less of a requirement than it is a mindset.
>> Also, sometimes 3-5 times isn't enough. In such cases, shoot for
>> the moon and try 20 assassination attempts in quick succession.
>>
>> What we are trying to do is to create a flood of messages requesting all
>> nodes completely forget there used to be an entry within the gossip state
>> for the given IP address. If each node can prune its own gossip state and
>> broadcast that to the rest of the nodes, we should eliminate any race
>> conditions that may exist where at least one node still remembers the given
>> IP address.
>>
>> As soon as all nodes come to agreement that they don't remember the
>> deprecated node, the cosmetic issue will no longer be a concern in any
>> system.logs, nodetool describecluster commands, nor nodetool gossipinfo
>> output.
>>
>>
>>
>>
>>
>> -Original Message-
>> From: Kenneth Brotman [mailto:kenbrot...@yahoo.com.INVALID]
>> Sent: Thursday, April 04, 2019 10:40 AM
>> To: user@cassandra.apache.org
>> Subject: RE: Assassinate fails
>>
>> Alex,
>>
>> Did you remove the option JVM_OPTS="$JVM_OPTS
>> -Dcassandra.replace_address=address_of_dead_node after the node started and
>> then restart the node again?
>>
>> Are you sure there isn't a typo in the file?
>>
>> Ken
>>
>>
>> -Original Message-
>> From: Kenneth Brotman [mailto:kenbrot...@yahoo.com.INVALID]
>> Sent: Thursday, April 04, 2019 10:31 AM
>> To: user@cassandra.apache.org
>> Subject: RE: Assassinate fails
>>
>> I see; system_auth is a separate keyspace.
>>
>> -Original Message-
>> From: Jon Haddad [mailto:j...@jonhaddad.com]
>> Sent: Thursday, April 04, 2019 10:17 AM
>> To: user@cassandra.apache.org
>> Subject: Re: Assassinate fails
>>
>> No, it can't.  As Alain (and I) have said, since the system keyspace
>> is local strategy, it's not replicated, and thus can't be repaired.
>>
>> On Thu, Apr 4, 2019 at 9:54 AM Kenneth Brotman
>>  wrote:
>> >
>> > Right, could be similar issue, same type of fix though.
>> >
>> > -Original Message-
>> > From: Jon Haddad [mailto:j...@jonhaddad.com]
>> > Sent: Thursday, April 04, 2019 9:52 AM
>> > To: user@cassandra.apache.org
>> > Subject: Re: Assassinate fails
>> >
>> > System != system_auth.
>> >
>> > On Thu, Apr 4, 2019 at 

Re: Assassinate fails

2019-08-16 Thread Alex
y to say be careful, watch your step, make sure you 
> remove the good line, ensure it works on one node with no harm. 
> I shared my feeling and I would try this fix. But it's ultimately your 
> responsibility and I won't be behind the machine when you'll fix it. None of 
> us will. 
> 
> Good luck ! :) 
> 
> C*heers, 
> 
> --- 
> Alain Rodriguez - al...@thelastpickle.com 
> France / Spain 
> 
> The Last Pickle - Apache Cassandra Consulting 
> http://www.thelastpickle.com 
> 
> Le jeu. 4 avr. 2019 à 19:29, Kenneth Brotman  a 
> écrit : 
> 
>> Alex,
>> 
>> According to this TLP article 
>> http://thelastpickle.com/blog/2018/09/18/assassinate.html :
>> 
>> Note that the LEFT status should stick around for 72 hours to ensure all 
>> nodes come to the consensus that the node has been removed. So please don't 
>> rush things if that's the case. Again, it's only cosmetic.
>> 
>> If a gossip state will not forget a node that was removed from the cluster 
>> more than a week ago:
>> 
>> Login to each node within the Cassandra cluster.
>> Download jmxterm on each node, if nodetool assassinate is not an option.
>> Run nodetool assassinate, or the unsafeAssassinateEndpoint command, multiple 
>> times in quick succession.
>> I typically recommend running the command 3-5 times within 2 seconds.
>> I understand that sometimes the command takes time to return, so the "2 
>> seconds" suggestion is less of a requirement than it is a mindset.
>> Also, sometimes 3-5 times isn't enough. In such cases, shoot for the moon 
>> and try 20 assassination attempts in quick succession.
>> 
>> What we are trying to do is to create a flood of messages requesting all 
>> nodes completely forget there used to be an entry within the gossip state 
>> for the given IP address. If each node can prune its own gossip state and 
>> broadcast that to the rest of the nodes, we should eliminate any race 
>> conditions that may exist where at least one node still remembers the given 
>> IP address.
>> 
>> As soon as all nodes come to agreement that they don't remember the 
>> deprecated node, the cosmetic issue will no longer be a concern in any 
>> system.logs, nodetool describecluster commands, nor nodetool gossipinfo 
>> output.
>> 
>> -Original Message-
>> From: Kenneth Brotman [mailto:kenbrot...@yahoo.com.INVALID] 
>> Sent: Thursday, April 04, 2019 10:40 AM
>> To: user@cassandra.apache.org
>> Subject: RE: Assassinate fails
>> 
>> Alex,
>> 
>> Did you remove the option JVM_OPTS="$JVM_OPTS 
>> -Dcassandra.replace_address=address_of_dead_node after the node started and 
>> then restart the node again?
>> 
>> Are you sure there isn't a typo in the file?
>> 
>> Ken
>> 
>> -Original Message-
>> From: Kenneth Brotman [mailto:kenbrot...@yahoo.com.INVALID] 
>> Sent: Thursday, April 04, 2019 10:31 AM
>> To: user@cassandra.apache.org
>> Subject: RE: Assassinate fails
>> 
>> I see; system_auth is a separate keyspace.
>> 
>> -Original Message-
>> From: Jon Haddad [mailto:j...@jonhaddad.com] 
>> Sent: Thursday, April 04, 2019 10:17 AM
>> To: user@cassandra.apache.org
>> Subject: Re: Assassinate fails
>> 
>> No, it can't.  As Alain (and I) have said, since the system keyspace
>> is local strategy, it's not replicated, and thus can't be repaired.
>> 
>> On Thu, Apr 4, 2019 at 9:54 AM Kenneth Brotman
>>  wrote:
>>> 
>>> Right, could be similar issue, same type of fix though.
>>> 
>>> -Original Message-
>>> From: Jon Haddad [mailto:j...@jonhaddad.com]
>>> Sent: Thursday, April 04, 2019 9:52 AM
>>> To: user@cassandra.apache.org
>>> Subject: Re: Assassinate fails
>>> 
>>> System != system_auth.
>>> 
>>> On Thu, Apr 4, 2019 at 9:43 AM Kenneth Brotman
>>>  wrote:
>>>> 
>>>> From Mastering Cassandra:
>>>> 
>>>> 
>>>> Forcing read repairs at consistency - ALL
>>>> 
>>>> The type of repair isn't really part of the Apache Cassandra repair 
>>>> paradigm at all. When it was discovered that a read repair will trigger 
>>>> 100% of the time when a query is run at ALL consistency, this method of 
>>>> repair started to gain popularity in the community. In some cases, this 
>>>> method of forcing data consistency provided better results than normal, 
>>>> scheduled repairs.
>>>&

Re: Assassinate fails

2019-04-05 Thread Alain RODRIGUEZ
sn’t enough. In such cases, shoot for
> the moon and try 20 assassination attempts in quick succession.
>
> What we are trying to do is to create a flood of messages requesting all
> nodes completely forget there used to be an entry within the gossip state
> for the given IP address. If each node can prune its own gossip state and
> broadcast that to the rest of the nodes, we should eliminate any race
> conditions that may exist where at least one node still remembers the given
> IP address.
>
> As soon as all nodes come to agreement that they don’t remember the
> deprecated node, the cosmetic issue will no longer be a concern in any
> system.logs, nodetool describecluster commands, nor nodetool gossipinfo
> output.
>
>
>
>
>
> -Original Message-----
> From: Kenneth Brotman [mailto:kenbrot...@yahoo.com.INVALID]
> Sent: Thursday, April 04, 2019 10:40 AM
> To: user@cassandra.apache.org
> Subject: RE: Assassinate fails
>
> Alex,
>
> Did you remove the option JVM_OPTS="$JVM_OPTS
> -Dcassandra.replace_address=address_of_dead_node after the node started and
> then restart the node again?
>
> Are you sure there isn't a typo in the file?
>
> Ken
>
>
> -Original Message-
> From: Kenneth Brotman [mailto:kenbrot...@yahoo.com.INVALID]
> Sent: Thursday, April 04, 2019 10:31 AM
> To: user@cassandra.apache.org
> Subject: RE: Assassinate fails
>
> I see; system_auth is a separate keyspace.
>
> -Original Message-
> From: Jon Haddad [mailto:j...@jonhaddad.com]
> Sent: Thursday, April 04, 2019 10:17 AM
> To: user@cassandra.apache.org
> Subject: Re: Assassinate fails
>
> No, it can't.  As Alain (and I) have said, since the system keyspace
> is local strategy, it's not replicated, and thus can't be repaired.
>
> On Thu, Apr 4, 2019 at 9:54 AM Kenneth Brotman
>  wrote:
> >
> > Right, could be similar issue, same type of fix though.
> >
> > -Original Message-
> > From: Jon Haddad [mailto:j...@jonhaddad.com]
> > Sent: Thursday, April 04, 2019 9:52 AM
> > To: user@cassandra.apache.org
> > Subject: Re: Assassinate fails
> >
> > System != system_auth.
> >
> > On Thu, Apr 4, 2019 at 9:43 AM Kenneth Brotman
> >  wrote:
> > >
> > > From Mastering Cassandra:
> > >
> > >
> > > Forcing read repairs at consistency – ALL
> > >
> > > The type of repair isn't really part of the Apache Cassandra repair
> paradigm at all. When it was discovered that a read repair will trigger
> 100% of the time when a query is run at ALL consistency, this method of
> repair started to gain popularity in the community. In some cases, this
> method of forcing data consistency provided better results than normal,
> scheduled repairs.
> > >
> > > Let's assume, for a second, that an application team is having a hard
> time logging into a node in a new data center. You try to cqlsh out to
> these nodes, and notice that you are also experiencing intermittent
> failures, leading you to suspect that the system_auth tables might be
> missing a replica or two. On one node you do manage to connect successfully
> using cqlsh. One quick way to fix consistency on the system_auth tables is
> to set consistency to ALL, and run an unbound SELECT on every table,
> tickling each record:
> > >
> > > use system_auth ;
> > > consistency ALL;
> > > consistency level set to ALL.
> > >
> > > SELECT COUNT(*) FROM resource_role_permissons_index ;
> > > SELECT COUNT(*) FROM role_permissions ;
> > > SELECT COUNT(*) FROM role_members ;
> > > SELECT COUNT(*) FROM roles;
> > >
> > > This problem is often seen when logging in with the default cassandra
> user. Within cqlsh, there is code that forces the default cassandra user to
> connect by querying system_auth at QUORUM consistency. This can be
> problematic in larger clusters, and is another reason why you should never
> use the default cassandra user.
> > >
> > >
> > >
> > > -Original Message-
> > > From: Jon Haddad [mailto:j...@jonhaddad.com]
> > > Sent: Thursday, April 04, 2019 9:21 AM
> > > To: user@cassandra.apache.org
> > > Subject: Re: Assassinate fails
> > >
> > > Ken,
> > >
> > > Alain is right about the system tables.  What you're describing only
> > > works on non-local tables.  Changing the CL doesn't help with
> > > keyspaces that use LocalStrategy.  Here's the definition of the system
> > > keyspace:
> > >
> > > CREATE KEYSPACE system WITH replication = {'class': 'LocalStrategy'}
&g

RE: Assassinate fails

2019-04-04 Thread Kenneth Brotman
Alex,

According to this TLP article 
http://thelastpickle.com/blog/2018/09/18/assassinate.html :

Note that the LEFT status should stick around for 72 hours to ensure all nodes 
come to the consensus that the node has been removed. So please don’t rush 
things if that’s the case. Again, it’s only cosmetic.

If a gossip state will not forget a node that was removed from the cluster more 
than a week ago:

Login to each node within the Cassandra cluster.
Download jmxterm on each node, if nodetool assassinate is not an option.
Run nodetool assassinate, or the unsafeAssassinateEndpoint command, 
multiple times in quick succession.
I typically recommend running the command 3-5 times within 2 seconds.
I understand that sometimes the command takes time to return, so the “2 
seconds” suggestion is less of a requirement than it is a mindset.
Also, sometimes 3-5 times isn’t enough. In such cases, shoot for the 
moon and try 20 assassination attempts in quick succession.

What we are trying to do is to create a flood of messages requesting all nodes 
completely forget there used to be an entry within the gossip state for the 
given IP address. If each node can prune its own gossip state and broadcast 
that to the rest of the nodes, we should eliminate any race conditions that may 
exist where at least one node still remembers the given IP address.

As soon as all nodes come to agreement that they don’t remember the deprecated 
node, the cosmetic issue will no longer be a concern in any system.logs, 
nodetool describecluster commands, nor nodetool gossipinfo output.





-Original Message-
From: Kenneth Brotman [mailto:kenbrot...@yahoo.com.INVALID] 
Sent: Thursday, April 04, 2019 10:40 AM
To: user@cassandra.apache.org
Subject: RE: Assassinate fails

Alex,

Did you remove the option JVM_OPTS="$JVM_OPTS 
-Dcassandra.replace_address=address_of_dead_node after the node started and 
then restart the node again?

Are you sure there isn't a typo in the file?

Ken


-Original Message-
From: Kenneth Brotman [mailto:kenbrot...@yahoo.com.INVALID] 
Sent: Thursday, April 04, 2019 10:31 AM
To: user@cassandra.apache.org
Subject: RE: Assassinate fails

I see; system_auth is a separate keyspace.

-Original Message-
From: Jon Haddad [mailto:j...@jonhaddad.com] 
Sent: Thursday, April 04, 2019 10:17 AM
To: user@cassandra.apache.org
Subject: Re: Assassinate fails

No, it can't.  As Alain (and I) have said, since the system keyspace
is local strategy, it's not replicated, and thus can't be repaired.

On Thu, Apr 4, 2019 at 9:54 AM Kenneth Brotman
 wrote:
>
> Right, could be similar issue, same type of fix though.
>
> -Original Message-
> From: Jon Haddad [mailto:j...@jonhaddad.com]
> Sent: Thursday, April 04, 2019 9:52 AM
> To: user@cassandra.apache.org
> Subject: Re: Assassinate fails
>
> System != system_auth.
>
> On Thu, Apr 4, 2019 at 9:43 AM Kenneth Brotman
>  wrote:
> >
> > From Mastering Cassandra:
> >
> >
> > Forcing read repairs at consistency – ALL
> >
> > The type of repair isn't really part of the Apache Cassandra repair 
> > paradigm at all. When it was discovered that a read repair will trigger 
> > 100% of the time when a query is run at ALL consistency, this method of 
> > repair started to gain popularity in the community. In some cases, this 
> > method of forcing data consistency provided better results than normal, 
> > scheduled repairs.
> >
> > Let's assume, for a second, that an application team is having a hard time 
> > logging into a node in a new data center. You try to cqlsh out to these 
> > nodes, and notice that you are also experiencing intermittent failures, 
> > leading you to suspect that the system_auth tables might be missing a 
> > replica or two. On one node you do manage to connect successfully using 
> > cqlsh. One quick way to fix consistency on the system_auth tables is to set 
> > consistency to ALL, and run an unbound SELECT on every table, tickling each 
> > record:
> >
> > use system_auth ;
> > consistency ALL;
> > consistency level set to ALL.
> >
> > SELECT COUNT(*) FROM resource_role_permissons_index ;
> > SELECT COUNT(*) FROM role_permissions ;
> > SELECT COUNT(*) FROM role_members ;
> > SELECT COUNT(*) FROM roles;
> >
> > This problem is often seen when logging in with the default cassandra user. 
> > Within cqlsh, there is code that forces the default cassandra user to 
> > connect by querying system_auth at QUORUM consistency. This can be 
> > problematic in larger clusters, and is another reason why you should never 
> > use the default cassandra user.
> >
> >
> >
> > -Original Message-
> > From: Jon Haddad 

RE: Assassinate fails

2019-04-04 Thread Kenneth Brotman
Alex,

Did you remove the option JVM_OPTS="$JVM_OPTS 
-Dcassandra.replace_address=address_of_dead_node after the node started and 
then restart the node again?

Are you sure there isn't a typo in the file?

Ken


-Original Message-
From: Kenneth Brotman [mailto:kenbrot...@yahoo.com.INVALID] 
Sent: Thursday, April 04, 2019 10:31 AM
To: user@cassandra.apache.org
Subject: RE: Assassinate fails

I see; system_auth is a separate keyspace.

-Original Message-
From: Jon Haddad [mailto:j...@jonhaddad.com] 
Sent: Thursday, April 04, 2019 10:17 AM
To: user@cassandra.apache.org
Subject: Re: Assassinate fails

No, it can't.  As Alain (and I) have said, since the system keyspace
is local strategy, it's not replicated, and thus can't be repaired.

On Thu, Apr 4, 2019 at 9:54 AM Kenneth Brotman
 wrote:
>
> Right, could be similar issue, same type of fix though.
>
> -Original Message-
> From: Jon Haddad [mailto:j...@jonhaddad.com]
> Sent: Thursday, April 04, 2019 9:52 AM
> To: user@cassandra.apache.org
> Subject: Re: Assassinate fails
>
> System != system_auth.
>
> On Thu, Apr 4, 2019 at 9:43 AM Kenneth Brotman
>  wrote:
> >
> > From Mastering Cassandra:
> >
> >
> > Forcing read repairs at consistency – ALL
> >
> > The type of repair isn't really part of the Apache Cassandra repair 
> > paradigm at all. When it was discovered that a read repair will trigger 
> > 100% of the time when a query is run at ALL consistency, this method of 
> > repair started to gain popularity in the community. In some cases, this 
> > method of forcing data consistency provided better results than normal, 
> > scheduled repairs.
> >
> > Let's assume, for a second, that an application team is having a hard time 
> > logging into a node in a new data center. You try to cqlsh out to these 
> > nodes, and notice that you are also experiencing intermittent failures, 
> > leading you to suspect that the system_auth tables might be missing a 
> > replica or two. On one node you do manage to connect successfully using 
> > cqlsh. One quick way to fix consistency on the system_auth tables is to set 
> > consistency to ALL, and run an unbound SELECT on every table, tickling each 
> > record:
> >
> > use system_auth ;
> > consistency ALL;
> > consistency level set to ALL.
> >
> > SELECT COUNT(*) FROM resource_role_permissons_index ;
> > SELECT COUNT(*) FROM role_permissions ;
> > SELECT COUNT(*) FROM role_members ;
> > SELECT COUNT(*) FROM roles;
> >
> > This problem is often seen when logging in with the default cassandra user. 
> > Within cqlsh, there is code that forces the default cassandra user to 
> > connect by querying system_auth at QUORUM consistency. This can be 
> > problematic in larger clusters, and is another reason why you should never 
> > use the default cassandra user.
> >
> >
> >
> > -Original Message-
> > From: Jon Haddad [mailto:j...@jonhaddad.com]
> > Sent: Thursday, April 04, 2019 9:21 AM
> > To: user@cassandra.apache.org
> > Subject: Re: Assassinate fails
> >
> > Ken,
> >
> > Alain is right about the system tables.  What you're describing only
> > works on non-local tables.  Changing the CL doesn't help with
> > keyspaces that use LocalStrategy.  Here's the definition of the system
> > keyspace:
> >
> > CREATE KEYSPACE system WITH replication = {'class': 'LocalStrategy'}
> > AND durable_writes = true;
> >
> > Jon
> >
> > On Thu, Apr 4, 2019 at 9:03 AM Kenneth Brotman
> >  wrote:
> > >
> > > The trick below I got from the book Mastering Cassandra.  You have to set 
> > > the consistency to ALL for it to work. I thought you guys knew that one.
> > >
> > >
> > >
> > > From: Alain RODRIGUEZ [mailto:arodr...@gmail.com]
> > > Sent: Thursday, April 04, 2019 8:46 AM
> > > To: user cassandra.apache.org
> > > Subject: Re: Assassinate fails
> > >
> > >
> > >
> > > Hi Alex,
> > >
> > >
> > >
> > > About previous advices:
> > >
> > >
> > >
> > > You might have inconsistent data in your system tables.  Try setting the 
> > > consistency level to ALL, then do read query of system tables to force 
> > > repair.
> > >
> > >
> > >
> > > System tables use the 'LocalStrategy', thus I don't think any repair 
> > > would happen for the system.* tables. Regardless the consistency you use. 
> > > It should not harm, but I really 

RE: Assassinate fails

2019-04-04 Thread Kenneth Brotman
I see; system_auth is a separate keyspace.

-Original Message-
From: Jon Haddad [mailto:j...@jonhaddad.com] 
Sent: Thursday, April 04, 2019 10:17 AM
To: user@cassandra.apache.org
Subject: Re: Assassinate fails

No, it can't.  As Alain (and I) have said, since the system keyspace
is local strategy, it's not replicated, and thus can't be repaired.

On Thu, Apr 4, 2019 at 9:54 AM Kenneth Brotman
 wrote:
>
> Right, could be similar issue, same type of fix though.
>
> -Original Message-
> From: Jon Haddad [mailto:j...@jonhaddad.com]
> Sent: Thursday, April 04, 2019 9:52 AM
> To: user@cassandra.apache.org
> Subject: Re: Assassinate fails
>
> System != system_auth.
>
> On Thu, Apr 4, 2019 at 9:43 AM Kenneth Brotman
>  wrote:
> >
> > From Mastering Cassandra:
> >
> >
> > Forcing read repairs at consistency – ALL
> >
> > The type of repair isn't really part of the Apache Cassandra repair 
> > paradigm at all. When it was discovered that a read repair will trigger 
> > 100% of the time when a query is run at ALL consistency, this method of 
> > repair started to gain popularity in the community. In some cases, this 
> > method of forcing data consistency provided better results than normal, 
> > scheduled repairs.
> >
> > Let's assume, for a second, that an application team is having a hard time 
> > logging into a node in a new data center. You try to cqlsh out to these 
> > nodes, and notice that you are also experiencing intermittent failures, 
> > leading you to suspect that the system_auth tables might be missing a 
> > replica or two. On one node you do manage to connect successfully using 
> > cqlsh. One quick way to fix consistency on the system_auth tables is to set 
> > consistency to ALL, and run an unbound SELECT on every table, tickling each 
> > record:
> >
> > use system_auth ;
> > consistency ALL;
> > consistency level set to ALL.
> >
> > SELECT COUNT(*) FROM resource_role_permissons_index ;
> > SELECT COUNT(*) FROM role_permissions ;
> > SELECT COUNT(*) FROM role_members ;
> > SELECT COUNT(*) FROM roles;
> >
> > This problem is often seen when logging in with the default cassandra user. 
> > Within cqlsh, there is code that forces the default cassandra user to 
> > connect by querying system_auth at QUORUM consistency. This can be 
> > problematic in larger clusters, and is another reason why you should never 
> > use the default cassandra user.
> >
> >
> >
> > -Original Message-
> > From: Jon Haddad [mailto:j...@jonhaddad.com]
> > Sent: Thursday, April 04, 2019 9:21 AM
> > To: user@cassandra.apache.org
> > Subject: Re: Assassinate fails
> >
> > Ken,
> >
> > Alain is right about the system tables.  What you're describing only
> > works on non-local tables.  Changing the CL doesn't help with
> > keyspaces that use LocalStrategy.  Here's the definition of the system
> > keyspace:
> >
> > CREATE KEYSPACE system WITH replication = {'class': 'LocalStrategy'}
> > AND durable_writes = true;
> >
> > Jon
> >
> > On Thu, Apr 4, 2019 at 9:03 AM Kenneth Brotman
> >  wrote:
> > >
> > > The trick below I got from the book Mastering Cassandra.  You have to set 
> > > the consistency to ALL for it to work. I thought you guys knew that one.
> > >
> > >
> > >
> > > From: Alain RODRIGUEZ [mailto:arodr...@gmail.com]
> > > Sent: Thursday, April 04, 2019 8:46 AM
> > > To: user cassandra.apache.org
> > > Subject: Re: Assassinate fails
> > >
> > >
> > >
> > > Hi Alex,
> > >
> > >
> > >
> > > About previous advices:
> > >
> > >
> > >
> > > You might have inconsistent data in your system tables.  Try setting the 
> > > consistency level to ALL, then do read query of system tables to force 
> > > repair.
> > >
> > >
> > >
> > > System tables use the 'LocalStrategy', thus I don't think any repair 
> > > would happen for the system.* tables. Regardless the consistency you use. 
> > > It should not harm, but I really think it won't help.
> > >
> > >
> >
> > -
> > To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> > For additional commands, e-mail: user-h...@cassandra.apache.org
> >
> >
> > -
> > To unsubscrib

Re: Assassinate fails

2019-04-04 Thread Jon Haddad
No, it can't.  As Alain (and I) have said, since the system keyspace
is local strategy, it's not replicated, and thus can't be repaired.

On Thu, Apr 4, 2019 at 9:54 AM Kenneth Brotman
 wrote:
>
> Right, could be similar issue, same type of fix though.
>
> -Original Message-
> From: Jon Haddad [mailto:j...@jonhaddad.com]
> Sent: Thursday, April 04, 2019 9:52 AM
> To: user@cassandra.apache.org
> Subject: Re: Assassinate fails
>
> System != system_auth.
>
> On Thu, Apr 4, 2019 at 9:43 AM Kenneth Brotman
>  wrote:
> >
> > From Mastering Cassandra:
> >
> >
> > Forcing read repairs at consistency – ALL
> >
> > The type of repair isn't really part of the Apache Cassandra repair 
> > paradigm at all. When it was discovered that a read repair will trigger 
> > 100% of the time when a query is run at ALL consistency, this method of 
> > repair started to gain popularity in the community. In some cases, this 
> > method of forcing data consistency provided better results than normal, 
> > scheduled repairs.
> >
> > Let's assume, for a second, that an application team is having a hard time 
> > logging into a node in a new data center. You try to cqlsh out to these 
> > nodes, and notice that you are also experiencing intermittent failures, 
> > leading you to suspect that the system_auth tables might be missing a 
> > replica or two. On one node you do manage to connect successfully using 
> > cqlsh. One quick way to fix consistency on the system_auth tables is to set 
> > consistency to ALL, and run an unbound SELECT on every table, tickling each 
> > record:
> >
> > use system_auth ;
> > consistency ALL;
> > consistency level set to ALL.
> >
> > SELECT COUNT(*) FROM resource_role_permissons_index ;
> > SELECT COUNT(*) FROM role_permissions ;
> > SELECT COUNT(*) FROM role_members ;
> > SELECT COUNT(*) FROM roles;
> >
> > This problem is often seen when logging in with the default cassandra user. 
> > Within cqlsh, there is code that forces the default cassandra user to 
> > connect by querying system_auth at QUORUM consistency. This can be 
> > problematic in larger clusters, and is another reason why you should never 
> > use the default cassandra user.
> >
> >
> >
> > -Original Message-
> > From: Jon Haddad [mailto:j...@jonhaddad.com]
> > Sent: Thursday, April 04, 2019 9:21 AM
> > To: user@cassandra.apache.org
> > Subject: Re: Assassinate fails
> >
> > Ken,
> >
> > Alain is right about the system tables.  What you're describing only
> > works on non-local tables.  Changing the CL doesn't help with
> > keyspaces that use LocalStrategy.  Here's the definition of the system
> > keyspace:
> >
> > CREATE KEYSPACE system WITH replication = {'class': 'LocalStrategy'}
> > AND durable_writes = true;
> >
> > Jon
> >
> > On Thu, Apr 4, 2019 at 9:03 AM Kenneth Brotman
> >  wrote:
> > >
> > > The trick below I got from the book Mastering Cassandra.  You have to set 
> > > the consistency to ALL for it to work. I thought you guys knew that one.
> > >
> > >
> > >
> > > From: Alain RODRIGUEZ [mailto:arodr...@gmail.com]
> > > Sent: Thursday, April 04, 2019 8:46 AM
> > > To: user cassandra.apache.org
> > > Subject: Re: Assassinate fails
> > >
> > >
> > >
> > > Hi Alex,
> > >
> > >
> > >
> > > About previous advices:
> > >
> > >
> > >
> > > You might have inconsistent data in your system tables.  Try setting the 
> > > consistency level to ALL, then do read query of system tables to force 
> > > repair.
> > >
> > >
> > >
> > > System tables use the 'LocalStrategy', thus I don't think any repair 
> > > would happen for the system.* tables. Regardless the consistency you use. 
> > > It should not harm, but I really think it won't help.
> > >
> > >
> >
> > -
> > To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> > For additional commands, e-mail: user-h...@cassandra.apache.org
> >
> >
> > -
> > To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> > For additional commands, e-mail: user-h...@cassandra.apache.org
> >
>
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
>

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



RE: Assassinate fails

2019-04-04 Thread Kenneth Brotman
Right, could be similar issue, same type of fix though.

-Original Message-
From: Jon Haddad [mailto:j...@jonhaddad.com] 
Sent: Thursday, April 04, 2019 9:52 AM
To: user@cassandra.apache.org
Subject: Re: Assassinate fails

System != system_auth.

On Thu, Apr 4, 2019 at 9:43 AM Kenneth Brotman
 wrote:
>
> From Mastering Cassandra:
>
>
> Forcing read repairs at consistency – ALL
>
> The type of repair isn't really part of the Apache Cassandra repair paradigm 
> at all. When it was discovered that a read repair will trigger 100% of the 
> time when a query is run at ALL consistency, this method of repair started to 
> gain popularity in the community. In some cases, this method of forcing data 
> consistency provided better results than normal, scheduled repairs.
>
> Let's assume, for a second, that an application team is having a hard time 
> logging into a node in a new data center. You try to cqlsh out to these 
> nodes, and notice that you are also experiencing intermittent failures, 
> leading you to suspect that the system_auth tables might be missing a replica 
> or two. On one node you do manage to connect successfully using cqlsh. One 
> quick way to fix consistency on the system_auth tables is to set consistency 
> to ALL, and run an unbound SELECT on every table, tickling each record:
>
> use system_auth ;
> consistency ALL;
> consistency level set to ALL.
>
> SELECT COUNT(*) FROM resource_role_permissons_index ;
> SELECT COUNT(*) FROM role_permissions ;
> SELECT COUNT(*) FROM role_members ;
> SELECT COUNT(*) FROM roles;
>
> This problem is often seen when logging in with the default cassandra user. 
> Within cqlsh, there is code that forces the default cassandra user to connect 
> by querying system_auth at QUORUM consistency. This can be problematic in 
> larger clusters, and is another reason why you should never use the default 
> cassandra user.
>
>
>
> -Original Message-
> From: Jon Haddad [mailto:j...@jonhaddad.com]
> Sent: Thursday, April 04, 2019 9:21 AM
> To: user@cassandra.apache.org
> Subject: Re: Assassinate fails
>
> Ken,
>
> Alain is right about the system tables.  What you're describing only
> works on non-local tables.  Changing the CL doesn't help with
> keyspaces that use LocalStrategy.  Here's the definition of the system
> keyspace:
>
> CREATE KEYSPACE system WITH replication = {'class': 'LocalStrategy'}
> AND durable_writes = true;
>
> Jon
>
> On Thu, Apr 4, 2019 at 9:03 AM Kenneth Brotman
>  wrote:
> >
> > The trick below I got from the book Mastering Cassandra.  You have to set 
> > the consistency to ALL for it to work. I thought you guys knew that one.
> >
> >
> >
> > From: Alain RODRIGUEZ [mailto:arodr...@gmail.com]
> > Sent: Thursday, April 04, 2019 8:46 AM
> > To: user cassandra.apache.org
> > Subject: Re: Assassinate fails
> >
> >
> >
> > Hi Alex,
> >
> >
> >
> > About previous advices:
> >
> >
> >
> > You might have inconsistent data in your system tables.  Try setting the 
> > consistency level to ALL, then do read query of system tables to force 
> > repair.
> >
> >
> >
> > System tables use the 'LocalStrategy', thus I don't think any repair would 
> > happen for the system.* tables. Regardless the consistency you use. It 
> > should not harm, but I really think it won't help.
> >
> >
>
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
>

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Re: Assassinate fails

2019-04-04 Thread Jon Haddad
System != system_auth.

On Thu, Apr 4, 2019 at 9:43 AM Kenneth Brotman
 wrote:
>
> From Mastering Cassandra:
>
>
> Forcing read repairs at consistency – ALL
>
> The type of repair isn't really part of the Apache Cassandra repair paradigm 
> at all. When it was discovered that a read repair will trigger 100% of the 
> time when a query is run at ALL consistency, this method of repair started to 
> gain popularity in the community. In some cases, this method of forcing data 
> consistency provided better results than normal, scheduled repairs.
>
> Let's assume, for a second, that an application team is having a hard time 
> logging into a node in a new data center. You try to cqlsh out to these 
> nodes, and notice that you are also experiencing intermittent failures, 
> leading you to suspect that the system_auth tables might be missing a replica 
> or two. On one node you do manage to connect successfully using cqlsh. One 
> quick way to fix consistency on the system_auth tables is to set consistency 
> to ALL, and run an unbound SELECT on every table, tickling each record:
>
> use system_auth ;
> consistency ALL;
> consistency level set to ALL.
>
> SELECT COUNT(*) FROM resource_role_permissons_index ;
> SELECT COUNT(*) FROM role_permissions ;
> SELECT COUNT(*) FROM role_members ;
> SELECT COUNT(*) FROM roles;
>
> This problem is often seen when logging in with the default cassandra user. 
> Within cqlsh, there is code that forces the default cassandra user to connect 
> by querying system_auth at QUORUM consistency. This can be problematic in 
> larger clusters, and is another reason why you should never use the default 
> cassandra user.
>
>
>
> -Original Message-
> From: Jon Haddad [mailto:j...@jonhaddad.com]
> Sent: Thursday, April 04, 2019 9:21 AM
> To: user@cassandra.apache.org
> Subject: Re: Assassinate fails
>
> Ken,
>
> Alain is right about the system tables.  What you're describing only
> works on non-local tables.  Changing the CL doesn't help with
> keyspaces that use LocalStrategy.  Here's the definition of the system
> keyspace:
>
> CREATE KEYSPACE system WITH replication = {'class': 'LocalStrategy'}
> AND durable_writes = true;
>
> Jon
>
> On Thu, Apr 4, 2019 at 9:03 AM Kenneth Brotman
>  wrote:
> >
> > The trick below I got from the book Mastering Cassandra.  You have to set 
> > the consistency to ALL for it to work. I thought you guys knew that one.
> >
> >
> >
> > From: Alain RODRIGUEZ [mailto:arodr...@gmail.com]
> > Sent: Thursday, April 04, 2019 8:46 AM
> > To: user cassandra.apache.org
> > Subject: Re: Assassinate fails
> >
> >
> >
> > Hi Alex,
> >
> >
> >
> > About previous advices:
> >
> >
> >
> > You might have inconsistent data in your system tables.  Try setting the 
> > consistency level to ALL, then do read query of system tables to force 
> > repair.
> >
> >
> >
> > System tables use the 'LocalStrategy', thus I don't think any repair would 
> > happen for the system.* tables. Regardless the consistency you use. It 
> > should not harm, but I really think it won't help.
> >
> >
>
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
>

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Re: Assassinate fails

2019-04-04 Thread Alex
Well, I tried : rolling restart did not work its magic. 

|/ State=Normal/Leaving/Joining/Moving
--  Address   Load   Tokens   Owns (effective)  Host ID 
 Rack
UN  192.168.1.9   26.32 GiB  256  42.8%
76223d4c-9d9f-417f-be27-cebb791cddcc  rack1
UN  192.168.1.12  31.35 GiB  256  38.9%
719601e2-54a6-440e-a379-c9cf2dc20564  rack1
UN  192.168.1.17  25.2 GiB   256  41.4%
fa238b21-1db1-47dc-bfb7-beedc6c9967a  rack1
DN  192.168.1.18  ?  256  39.8% null
 rack1
UN  192.168.1.22  27.7 GiB   256  37.2%
09d24557-4e98-44c3-8c9d-53c4c31066e1  rack1 

Alex 

Le 04.04.2019 18:26, Alex a écrit :

> Hi, 
> 
> @ Alain and Kenneth : 
> 
> I use C* for a time series database (KairosDB) ; replication and consistency 
> are set by KairosDB and I would rather not mingle with it. 
> 
> @ Nick and Alain : 
> 
> I have tried to stop / start every node but not with this process. I will 
> try. 
> 
> @ Jeff : I removed (replaced) this node 13 days ago. 
> 
> @ Alain : In system.peers I see both the dead node and its replacement with 
> the same ID : 
> 
> peer | host_id
> --+--
> 192.168.1.18 | 09d24557-4e98-44c3-8c9d-53c4c31066e1
> 192.168.1.22 | 09d24557-4e98-44c3-8c9d-53c4c31066e1 
> 
> Is it expected ? 
> 
> If I cannot fix this, I think I will add new nodes and remove, one by one, 
> the nodes that show the dead node in nodetool status. 
> 
> Thanks all for your help. 
> 
> Alex 
> 
> Le 04.04.2019 17:45, Alain RODRIGUEZ a écrit : 
> 
> Hi Alex, 
> 
> About previous advices: 
> 
> You might have inconsistent data in your system tables.  Try setting the 
> consistency level to ALL, then do read query of system tables to force 
> repair. 
> 
> System tables use the 'LocalStrategy', thus I don't think any repair would 
> happen for the system.* tables. Regardless the consistency you use. It should 
> not harm, but I really think it won't help. 
> This will sound a little silly but, have you tried rolling the cluster? 
> 
> The other way around, the rolling restart does not sound that silly to me. I 
> would try it before touching any other 'deeper' systems. It has indeed 
> sometimes proven to do some magic for me as well. It's hard to guess on this 
> kind of ghost node issues without being working on the machine (and sometimes 
> even when accessing the machine I had some trouble =)). Also a rolling 
> restart is an operation that should be easy to perform and with low risk (if 
> everything is well configured). 
> 
> Other idea to explore: 
> 
> You can actually select the 'system.peers' table to see if all (other) nodes 
> are referenced for each node. There should not be any dead nodes in there. By 
> the way you will see that different nodes have slightly different data in 
> system.peers, and are not in sync, thus no way to 'repair' that really. 
> 'Select' is safe. If you delete non-existing 'peers' if any, If the node is 
> dead anyway, this shouldn't hurt, but make sure you are doing the right thing 
> you can easily break your cluster from there. I did not see an issue (a bug) 
> of those for a while though. Normally you should not have to go that deep 
> touching system tables. 
> 
> Then also nodes removed should be immediately removed from peers but persist 
> for some time (7 days maybe?) in the gossip information (normally as 'LEFT'). 
> This should not create the issue in 'nodetool describecluster' though. 
> 
> C*heers, 
> 
> --- 
> Alain Rodriguez - al...@thelastpickle.com 
> France / Spain 
> 
> The Last Pickle - Apache Cassandra Consulting 
> http://www.thelastpickle.com 
> 
> Le jeu. 4 avr. 2019 à 16:09, Nick Hatfield  a 
> écrit : 
> 
> This will sound a little silly but, have you tried rolling the cluster? 
> 
> $> nodetool flush; nodetool drain; service cassandra stop
> $> ps aux | grep 'cassandra'  
> 
> # make sure the process actually dies. If not you may need to kill -9 . 
> Check first to see if nodetool can connect first, nodetool gossipinfo. If the 
> connection is live and listening on the port, then just try re-running 
> service cassandra stop again. Kill -9 as a last resort
> 
> $> service cassandra start
> $> nodetool netstats | grep 'NORMAL'  # wait for this to return before moving 
> on to the next node. 
> 
> Restart them all using this method, then run nodetool status again and see if 
> it is listed. 
> 
> Once other thing, I recall you said something about having to terminate a 
> node and then replace it. Make sure that whiche

RE: Assassinate fails

2019-04-04 Thread Kenneth Brotman
>From Mastering Cassandra:


Forcing read repairs at consistency – ALL

The type of repair isn't really part of the Apache Cassandra repair paradigm at 
all. When it was discovered that a read repair will trigger 100% of the time 
when a query is run at ALL consistency, this method of repair started to gain 
popularity in the community. In some cases, this method of forcing data 
consistency provided better results than normal, scheduled repairs.

Let's assume, for a second, that an application team is having a hard time 
logging into a node in a new data center. You try to cqlsh out to these nodes, 
and notice that you are also experiencing intermittent failures, leading you to 
suspect that the system_auth tables might be missing a replica or two. On one 
node you do manage to connect successfully using cqlsh. One quick way to fix 
consistency on the system_auth tables is to set consistency to ALL, and run an 
unbound SELECT on every table, tickling each record:

use system_auth ;
consistency ALL;
consistency level set to ALL.

SELECT COUNT(*) FROM resource_role_permissons_index ;
SELECT COUNT(*) FROM role_permissions ;
SELECT COUNT(*) FROM role_members ;
SELECT COUNT(*) FROM roles;

This problem is often seen when logging in with the default cassandra user. 
Within cqlsh, there is code that forces the default cassandra user to connect 
by querying system_auth at QUORUM consistency. This can be problematic in 
larger clusters, and is another reason why you should never use the default 
cassandra user.



-Original Message-
From: Jon Haddad [mailto:j...@jonhaddad.com] 
Sent: Thursday, April 04, 2019 9:21 AM
To: user@cassandra.apache.org
Subject: Re: Assassinate fails

Ken,

Alain is right about the system tables.  What you're describing only
works on non-local tables.  Changing the CL doesn't help with
keyspaces that use LocalStrategy.  Here's the definition of the system
keyspace:

CREATE KEYSPACE system WITH replication = {'class': 'LocalStrategy'}
AND durable_writes = true;

Jon

On Thu, Apr 4, 2019 at 9:03 AM Kenneth Brotman
 wrote:
>
> The trick below I got from the book Mastering Cassandra.  You have to set the 
> consistency to ALL for it to work. I thought you guys knew that one.
>
>
>
> From: Alain RODRIGUEZ [mailto:arodr...@gmail.com]
> Sent: Thursday, April 04, 2019 8:46 AM
> To: user cassandra.apache.org
> Subject: Re: Assassinate fails
>
>
>
> Hi Alex,
>
>
>
> About previous advices:
>
>
>
> You might have inconsistent data in your system tables.  Try setting the 
> consistency level to ALL, then do read query of system tables to force repair.
>
>
>
> System tables use the 'LocalStrategy', thus I don't think any repair would 
> happen for the system.* tables. Regardless the consistency you use. It should 
> not harm, but I really think it won't help.
>
>

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Re: Assassinate fails

2019-04-04 Thread Alex
Hi, 

@ Alain and Kenneth : 

I use C* for a time series database (KairosDB) ; replication and
consistency are set by KairosDB and I would rather not mingle with it. 

@ Nick and Alain : 

I have tried to stop / start every node but not with this process. I
will try. 

@ Jeff : I removed (replaced) this node 13 days ago. 

@ Alain : In system.peers I see both the dead node and its replacement
with the same ID : 

   peer | host_id
  --+--
   192.168.1.18 | 09d24557-4e98-44c3-8c9d-53c4c31066e1
   192.168.1.22 | 09d24557-4e98-44c3-8c9d-53c4c31066e1 

Is it expected ? 

If I cannot fix this, I think I will add new nodes and remove, one by
one, the nodes that show the dead node in nodetool status. 

Thanks all for your help. 

Alex 

Le 04.04.2019 17:45, Alain RODRIGUEZ a écrit :

> Hi Alex, 
> 
> About previous advices: 
> 
>> You might have inconsistent data in your system tables.  Try setting the 
>> consistency level to ALL, then do read query of system tables to force 
>> repair.
> 
> System tables use the 'LocalStrategy', thus I don't think any repair would 
> happen for the system.* tables. Regardless the consistency you use. It should 
> not harm, but I really think it won't help. 
> 
>> This will sound a little silly but, have you tried rolling the cluster?
> 
> The other way around, the rolling restart does not sound that silly to me. I 
> would try it before touching any other 'deeper' systems. It has indeed 
> sometimes proven to do some magic for me as well. It's hard to guess on this 
> kind of ghost node issues without being working on the machine (and sometimes 
> even when accessing the machine I had some trouble =)). Also a rolling 
> restart is an operation that should be easy to perform and with low risk (if 
> everything is well configured). 
> 
> Other idea to explore: 
> 
> You can actually select the 'system.peers' table to see if all (other) nodes 
> are referenced for each node. There should not be any dead nodes in there. By 
> the way you will see that different nodes have slightly different data in 
> system.peers, and are not in sync, thus no way to 'repair' that really. 
> 'Select' is safe. If you delete non-existing 'peers' if any, If the node is 
> dead anyway, this shouldn't hurt, but make sure you are doing the right thing 
> you can easily break your cluster from there. I did not see an issue (a bug) 
> of those for a while though. Normally you should not have to go that deep 
> touching system tables. 
> 
> Then also nodes removed should be immediately removed from peers but persist 
> for some time (7 days maybe?) in the gossip information (normally as 'LEFT'). 
> This should not create the issue in 'nodetool describecluster' though. 
> 
> C*heers, 
> 
> --- 
> Alain Rodriguez - al...@thelastpickle.com 
> France / Spain 
> 
> The Last Pickle - Apache Cassandra Consulting 
> http://www.thelastpickle.com 
> 
> Le jeu. 4 avr. 2019 à 16:09, Nick Hatfield  a 
> écrit : 
> 
> This will sound a little silly but, have you tried rolling the cluster? 
> 
> $> nodetool flush; nodetool drain; service cassandra stop
> $> ps aux | grep 'cassandra'  
> 
> # make sure the process actually dies. If not you may need to kill -9 . 
> Check first to see if nodetool can connect first, nodetool gossipinfo. If the 
> connection is live and listening on the port, then just try re-running 
> service cassandra stop again. Kill -9 as a last resort
> 
> $> service cassandra start
> $> nodetool netstats | grep 'NORMAL'  # wait for this to return before moving 
> on to the next node. 
> 
> Restart them all using this method, then run nodetool status again and see if 
> it is listed. 
> 
> Once other thing, I recall you said something about having to terminate a 
> node and then replace it. Make sure that whichever node you did the -Dreplace 
> flag on, does not still have it set when you start cassandra on it again! 
> 
> FROM: Alex [mailto:m...@aca-o.com] 
> SENT: Thursday, April 04, 2019 4:58 AM
> TO: user@cassandra.apache.org
> SUBJECT: Re: Assassinate fails 
> 
> Hi Anthony, 
> 
> Thanks for your help. 
> 
> I tried to run multiple times in quick succession but it fails with : 
> 
> -- StackTrace --
> java.lang.RuntimeException: Endpoint still alive: /192.168.1.18 [1] 
> generation changed while trying to assassinate it
> at org.apache.cassandra.gms.Gossiper.assassinateEndpoint(Gossiper.java:592) 
> 
> I can see that the generation number for this node increases by 1 every time 
> I call nodetool assassinate ; and the command itself waits for 30 seconds 
> before assassinating node. When ran multiple times in quick succession, the 
&

Re: Assassinate fails

2019-04-04 Thread Jon Haddad
Ken,

Alain is right about the system tables.  What you're describing only
works on non-local tables.  Changing the CL doesn't help with
keyspaces that use LocalStrategy.  Here's the definition of the system
keyspace:

CREATE KEYSPACE system WITH replication = {'class': 'LocalStrategy'}
AND durable_writes = true;

Jon

On Thu, Apr 4, 2019 at 9:03 AM Kenneth Brotman
 wrote:
>
> The trick below I got from the book Mastering Cassandra.  You have to set the 
> consistency to ALL for it to work. I thought you guys knew that one.
>
>
>
> From: Alain RODRIGUEZ [mailto:arodr...@gmail.com]
> Sent: Thursday, April 04, 2019 8:46 AM
> To: user cassandra.apache.org
> Subject: Re: Assassinate fails
>
>
>
> Hi Alex,
>
>
>
> About previous advices:
>
>
>
> You might have inconsistent data in your system tables.  Try setting the 
> consistency level to ALL, then do read query of system tables to force repair.
>
>
>
> System tables use the 'LocalStrategy', thus I don't think any repair would 
> happen for the system.* tables. Regardless the consistency you use. It should 
> not harm, but I really think it won't help.
>
>

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



RE: Assassinate fails

2019-04-04 Thread Kenneth Brotman
The trick below I got from the book Mastering Cassandra.  You have to set the 
consistency to ALL for it to work. I thought you guys knew that one.

 

From: Alain RODRIGUEZ [mailto:arodr...@gmail.com] 
Sent: Thursday, April 04, 2019 8:46 AM
To: user cassandra.apache.org
Subject: Re: Assassinate fails

 

Hi Alex,

 

About previous advices:

 

You might have inconsistent data in your system tables.  Try setting the 
consistency level to ALL, then do read query of system tables to force repair.

 

System tables use the 'LocalStrategy', thus I don't think any repair would 
happen for the system.* tables. Regardless the consistency you use. It should 
not harm, but I really think it won't help.

 



Re: Assassinate fails

2019-04-04 Thread Alain RODRIGUEZ
Hi Alex,

About previous advices:

You might have inconsistent data in your system tables.  Try setting the
> consistency level to ALL, then do read query of system tables to force
> repair.
>

System tables use the 'LocalStrategy', thus I don't think any repair would
happen for the system.* tables. Regardless the consistency you use. It
should not harm, but I really think it won't help.

This will sound a little silly but, have you tried rolling the cluster?


The other way around, the rolling restart does not sound that silly to me.
I would try it before touching any other 'deeper' systems. It has indeed
sometimes proven to do some magic for me as well. It's hard to guess on
this kind of ghost node issues without being working on the machine (and
sometimes even when accessing the machine I had some trouble =)). Also a
rolling restart is an operation that should be easy to perform and with low
risk (if everything is well configured).

Other idea to explore:

You can actually select the 'system.peers' table to see if all (other)
nodes are referenced for each node. There should not be any dead nodes in
there. By the way you will see that different nodes have slightly different
data in system.peers, and are not in sync, thus no way to 'repair' that
really.
'Select' is safe. If you delete non-existing 'peers' if any, If the node is
dead anyway, this shouldn't hurt, but make sure you are doing the right
thing you can easily break your cluster from there. I did not see an issue
(a bug) of those for a while though. Normally you should not have to go
that deep touching system tables.

Then also nodes removed should be immediately removed from peers but
persist for some time (7 days maybe?) in the gossip information (normally
as 'LEFT'). This should not create the issue in 'nodetool describecluster'
though.

C*heers,
---
Alain Rodriguez - al...@thelastpickle.com
France / Spain

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com


Le jeu. 4 avr. 2019 à 16:09, Nick Hatfield  a
écrit :

> This will sound a little silly but, have you tried rolling the cluster?
>
>
>
> $> nodetool flush; nodetool drain; service cassandra stop
> $> ps aux | grep ‘cassandra’
>
> # make sure the process actually dies. If not you may need to kill -9
> . Check first to see if nodetool can connect first, nodetool
> gossipinfo. If the connection is live and listening on the port, then just
> try re-running service cassandra stop again. Kill -9 as a last resort
>
> $> service cassandra start
> $> nodetool netstats | grep ‘NORMAL’  # wait for this to return before
> moving on to the next node.
>
>
>
> Restart them all using this method, then run nodetool status again and see
> if it is listed.
>
>
>
> Once other thing, I recall you said something about having to terminate a
> node and then replace it. Make sure that whichever node you did the
> –Dreplace flag on, does not still have it set when you start cassandra on
> it again!
>
>
>
> *From:* Alex [mailto:m...@aca-o.com]
> *Sent:* Thursday, April 04, 2019 4:58 AM
> *To:* user@cassandra.apache.org
> *Subject:* Re: Assassinate fails
>
>
>
> Hi Anthony,
>
> Thanks for your help.
>
> I tried to run multiple times in quick succession but it fails with :
>
> -- StackTrace --
> java.lang.RuntimeException: Endpoint still alive: /192.168.1.18
> generation changed while trying to assassinate it
> at
> org.apache.cassandra.gms.Gossiper.assassinateEndpoint(Gossiper.java:592)
>
> I can see that the generation number for this node increases by 1 every
> time I call nodetool assassinate ; and the command itself waits for 30
> seconds before assassinating node. When ran multiple times in quick
> succession, the command fails because the generation number has been
> changed by the previous instance.
>
>
>
> In 'nodetool gossipinfo', the node is marked as "LEFT" on every node.
>
> However, in 'nodetool describecluster', this node is marked as
> "unreacheable" on 3 nodes out of 5.
>
>
>
> Alex
>
>
>
> Le 04.04.2019 00:56, Anthony Grasso a écrit :
>
> Hi Alex,
>
>
>
> We wrote a blog post on this topic late last year:
> http://thelastpickle.com/blog/2018/09/18/assassinate.html.
>
>
>
> In short, you will need to run the assassinate command on each node
> simultaneously a number of times in quick succession. This will generate a
> number of messages requesting all nodes completely forget there used to be
> an entry within the gossip state for the given IP address.
>
>
>
> Regards,
>
> Anthony
>
>
>
> On Thu, 4 Apr 2019 at 03:32, Alex  wrote:
>
> Same result it seems:
> Welcome to JMX terminal. Type "help"

Re: Assassinate fails

2019-04-04 Thread Jeff Jirsa
How long ago did you remove this host from the cluster?



-- 
Jeff Jirsa


> On Apr 4, 2019, at 8:09 AM, Nick Hatfield  wrote:
> 
> This will sound a little silly but, have you tried rolling the cluster?
>  
> $> nodetool flush; nodetool drain; service cassandra stop
> $> ps aux | grep ‘cassandra’  
> 
> # make sure the process actually dies. If not you may need to kill -9 . 
> Check first to see if nodetool can connect first, nodetool gossipinfo. If the 
> connection is live and listening on the port, then just try re-running 
> service cassandra stop again. Kill -9 as a last resort
> 
> $> service cassandra start
> $> nodetool netstats | grep ‘NORMAL’  # wait for this to return before moving 
> on to the next node.
>  
> Restart them all using this method, then run nodetool status again and see if 
> it is listed.
>  
> Once other thing, I recall you said something about having to terminate a 
> node and then replace it. Make sure that whichever node you did the –Dreplace 
> flag on, does not still have it set when you start cassandra on it again!
>  
> From: Alex [mailto:m...@aca-o.com] 
> Sent: Thursday, April 04, 2019 4:58 AM
> To: user@cassandra.apache.org
> Subject: Re: Assassinate fails
>  
> Hi Anthony,
> 
> Thanks for your help.
> 
> I tried to run multiple times in quick succession but it fails with :
> 
> -- StackTrace --
> java.lang.RuntimeException: Endpoint still alive: /192.168.1.18 generation 
> changed while trying to assassinate it
> at 
> org.apache.cassandra.gms.Gossiper.assassinateEndpoint(Gossiper.java:592)
> 
> I can see that the generation number for this node increases by 1 every time 
> I call nodetool assassinate ; and the command itself waits for 30 seconds 
> before assassinating node. When ran multiple times in quick succession, the 
> command fails because the generation number has been changed by the previous 
> instance.
> 
>  
> 
> In 'nodetool gossipinfo', the node is marked as "LEFT" on every node.
> 
> However, in 'nodetool describecluster', this node is marked as "unreacheable" 
> on 3 nodes out of 5.
> 
>  
> 
> Alex
> 
>  
> 
> Le 04.04.2019 00:56, Anthony Grasso a écrit :
> 
> Hi Alex,
>  
> We wrote a blog post on this topic late last year: 
> http://thelastpickle.com/blog/2018/09/18/assassinate.html.
>  
> In short, you will need to run the assassinate command on each node 
> simultaneously a number of times in quick succession. This will generate a 
> number of messages requesting all nodes completely forget there used to be an 
> entry within the gossip state for the given IP address.
>  
> Regards,
> Anthony
>  
> On Thu, 4 Apr 2019 at 03:32, Alex  wrote:
> Same result it seems:
> Welcome to JMX terminal. Type "help" for available commands.
> $>open localhost:7199
> #Connection to localhost:7199 is opened
> $>bean org.apache.cassandra.net:type=Gossiper
> #bean is set to org.apache.cassandra.net:type=Gossiper
> $>run unsafeAssassinateEndpoint 192.168.1.18
> #calling operation unsafeAssassinateEndpoint of mbean 
> org.apache.cassandra.net:type=Gossiper
> #RuntimeMBeanException: java.lang.NullPointerException
> 
> 
> There not much more to see in log files :
> WARN  [RMI TCP Connection(10)-127.0.0.1] 2019-04-03 16:25:13,626 
> Gossiper.java:575 - Assassinating /192.168.1.18 via gossip
> INFO  [RMI TCP Connection(10)-127.0.0.1] 2019-04-03 16:25:13,627 
> Gossiper.java:585 - Sleeping for 3ms to ensure /192.168.1.18 does 
> not change
> INFO  [RMI TCP Connection(10)-127.0.0.1] 2019-04-03 16:25:43,628 
> Gossiper.java:1029 - InetAddress /192.168.1.18 is now DOWN
> INFO  [RMI TCP Connection(10)-127.0.0.1] 2019-04-03 16:25:43,631 
> StorageService.java:2324 - Removing tokens [..] for /192.168.1.18
> 
> 
> 
> 
> Le 03.04.2019 17:10, Nick Hatfield a écrit :
> > Run assassinate the old way. I works very well...
> > 
> > wget -q -O jmxterm.jar
> > http://downloads.sourceforge.net/cyclops-group/jmxterm-1.0-alpha-4-uber.jar
> > 
> > java -jar ./jmxterm.jar
> > 
> > $>open localhost:7199
> > 
> > $>bean org.apache.cassandra.net:type=Gossiper
> > 
> > $>run unsafeAssassinateEndpoint 192.168.1.18
> > 
> > $>quit
> > 
> > 
> > Happy deleting
> > 
> > -Original Message-
> > From: Alex [mailto:m...@aca-o.com]
> > Sent: Wednesday, April 03, 2019 10:42 AM
> > To: user@cassandra.apache.org
> > Subject: Assassinate fails
> > 
> > Hello,
> > 
> > Short story:
> > - I had to replace a dead node in my cluster
> > - 1 week afte

RE: Assassinate fails

2019-04-04 Thread Nick Hatfield
This will sound a little silly but, have you tried rolling the cluster?

$> nodetool flush; nodetool drain; service cassandra stop
$> ps aux | grep ‘cassandra’

# make sure the process actually dies. If not you may need to kill -9 . 
Check first to see if nodetool can connect first, nodetool gossipinfo. If the 
connection is live and listening on the port, then just try re-running service 
cassandra stop again. Kill -9 as a last resort

$> service cassandra start
$> nodetool netstats | grep ‘NORMAL’  # wait for this to return before moving 
on to the next node.

Restart them all using this method, then run nodetool status again and see if 
it is listed.

Once other thing, I recall you said something about having to terminate a node 
and then replace it. Make sure that whichever node you did the –Dreplace flag 
on, does not still have it set when you start cassandra on it again!

From: Alex [mailto:m...@aca-o.com]
Sent: Thursday, April 04, 2019 4:58 AM
To: user@cassandra.apache.org
Subject: Re: Assassinate fails


Hi Anthony,

Thanks for your help.

I tried to run multiple times in quick succession but it fails with :

-- StackTrace --
java.lang.RuntimeException: Endpoint still alive: /192.168.1.18 generation 
changed while trying to assassinate it
at 
org.apache.cassandra.gms.Gossiper.assassinateEndpoint(Gossiper.java:592)

I can see that the generation number for this node increases by 1 every time I 
call nodetool assassinate ; and the command itself waits for 30 seconds before 
assassinating node. When ran multiple times in quick succession, the command 
fails because the generation number has been changed by the previous instance.



In 'nodetool gossipinfo', the node is marked as "LEFT" on every node.

However, in 'nodetool describecluster', this node is marked as "unreacheable" 
on 3 nodes out of 5.



Alex



Le 04.04.2019 00:56, Anthony Grasso a écrit :
Hi Alex,

We wrote a blog post on this topic late last year: 
http://thelastpickle.com/blog/2018/09/18/assassinate.html.

In short, you will need to run the assassinate command on each node 
simultaneously a number of times in quick succession. This will generate a 
number of messages requesting all nodes completely forget there used to be an 
entry within the gossip state for the given IP address.

Regards,
Anthony

On Thu, 4 Apr 2019 at 03:32, Alex mailto:m...@aca-o.com>> wrote:
Same result it seems:
Welcome to JMX terminal. Type "help" for available commands.
$>open localhost:7199
#Connection to localhost:7199 is opened
$>bean org.apache.cassandra.net:type=Gossiper
#bean is set to org.apache.cassandra.net:type=Gossiper
$>run unsafeAssassinateEndpoint 192.168.1.18
#calling operation unsafeAssassinateEndpoint of mbean
org.apache.cassandra.net:type=Gossiper
#RuntimeMBeanException: java.lang.NullPointerException


There not much more to see in log files :
WARN  [RMI TCP Connection(10)-127.0.0.1] 2019-04-03 16:25:13,626
Gossiper.java:575 - Assassinating /192.168.1.18<http://192.168.1.18> via gossip
INFO  [RMI TCP Connection(10)-127.0.0.1] 2019-04-03 16:25:13,627
Gossiper.java:585 - Sleeping for 3ms to ensure 
/192.168.1.18<http://192.168.1.18> does
not change
INFO  [RMI TCP Connection(10)-127.0.0.1] 2019-04-03 16:25:43,628
Gossiper.java:1029 - InetAddress /192.168.1.18<http://192.168.1.18> is now DOWN
INFO  [RMI TCP Connection(10)-127.0.0.1] 2019-04-03 16:25:43,631
StorageService.java:2324 - Removing tokens [..] for 
/192.168.1.18<http://192.168.1.18>




Le 03.04.2019 17:10, Nick Hatfield a écrit :
> Run assassinate the old way. I works very well...
>
> wget -q -O jmxterm.jar
> http://downloads.sourceforge.net/cyclops-group/jmxterm-1.0-alpha-4-uber.jar
>
> java -jar ./jmxterm.jar
>
> $>open localhost:7199
>
> $>bean org.apache.cassandra.net:type=Gossiper
>
> $>run unsafeAssassinateEndpoint 192.168.1.18
>
> $>quit
>
>
> Happy deleting
>
> -Original Message-
> From: Alex [mailto:m...@aca-o.com<mailto:m...@aca-o.com>]
> Sent: Wednesday, April 03, 2019 10:42 AM
> To: user@cassandra.apache.org<mailto:user@cassandra.apache.org>
> Subject: Assassinate fails
>
> Hello,
>
> Short story:
> - I had to replace a dead node in my cluster
> - 1 week after, dead node is still seen as DN by 3 out of 5 nodes
> - dead node has null host_id
> - assassinate on dead node fails with error
>
> How can I get rid of this dead node ?
>
>
> Long story:
> I had a 3 nodes cluster (Cassandra 3.9) ; one node went dead. I built
> a new node from scratch and "replaced" the dead node using the
> information from this page
> https://docs.datastax.com/en/cassandra/3.0/cassandra/operations/opsReplaceNode.html.
> It looked like the replacement went ok.
>
> I added two more nodes to strengthen the cluster.
>
> A few 

RE: Assassinate fails

2019-04-04 Thread Kenneth Brotman
Hi Alex,

 

You might have inconsistent data in your system tables.  Try setting the 
consistency level to ALL, then do read query of system tables to force repair.

 

Kenneth Brotman

 

From: Alex [mailto:m...@aca-o.com] 
Sent: Thursday, April 04, 2019 1:58 AM
To: user@cassandra.apache.org
Subject: Re: Assassinate fails

 

Hi Anthony,

Thanks for your help.

I tried to run multiple times in quick succession but it fails with :

-- StackTrace --
java.lang.RuntimeException: Endpoint still alive: /192.168.1.18 generation 
changed while trying to assassinate it
at 
org.apache.cassandra.gms.Gossiper.assassinateEndpoint(Gossiper.java:592)

I can see that the generation number for this node increases by 1 every time I 
call nodetool assassinate ; and the command itself waits for 30 seconds before 
assassinating node. When ran multiple times in quick succession, the command 
fails because the generation number has been changed by the previous instance.

 

In 'nodetool gossipinfo', the node is marked as "LEFT" on every node.

However, in 'nodetool describecluster', this node is marked as "unreacheable" 
on 3 nodes out of 5.

 

Alex

 

Le 04.04.2019 00:56, Anthony Grasso a écrit :

Hi Alex, 

 

We wrote a blog post on this topic late last year: 
http://thelastpickle.com/blog/2018/09/18/assassinate.html.

 

In short, you will need to run the assassinate command on each node 
simultaneously a number of times in quick succession. This will generate a 
number of messages requesting all nodes completely forget there used to be an 
entry within the gossip state for the given IP address.

 

Regards,

Anthony

 

On Thu, 4 Apr 2019 at 03:32, Alex  wrote:

Same result it seems:
Welcome to JMX terminal. Type "help" for available commands.
$>open localhost:7199
#Connection to localhost:7199 is opened
$>bean org.apache.cassandra.net:type=Gossiper
#bean is set to org.apache.cassandra.net:type=Gossiper
$>run unsafeAssassinateEndpoint 192.168.1.18
#calling operation unsafeAssassinateEndpoint of mbean 
org.apache.cassandra.net:type=Gossiper
#RuntimeMBeanException: java.lang.NullPointerException


There not much more to see in log files :
WARN  [RMI TCP Connection(10)-127.0.0.1] 2019-04-03 16:25:13,626 
Gossiper.java:575 - Assassinating /192.168.1.18 via gossip
INFO  [RMI TCP Connection(10)-127.0.0.1] 2019-04-03 16:25:13,627 
Gossiper.java:585 - Sleeping for 3ms to ensure /192.168.1.18 does 
not change
INFO  [RMI TCP Connection(10)-127.0.0.1] 2019-04-03 16:25:43,628 
Gossiper.java:1029 - InetAddress /192.168.1.18 is now DOWN
INFO  [RMI TCP Connection(10)-127.0.0.1] 2019-04-03 16:25:43,631 
StorageService.java:2324 - Removing tokens [..] for /192.168.1.18




Le 03.04.2019 17:10, Nick Hatfield a écrit :
> Run assassinate the old way. I works very well...
> 
> wget -q -O jmxterm.jar
> http://downloads.sourceforge.net/cyclops-group/jmxterm-1.0-alpha-4-uber.jar
> 
> java -jar ./jmxterm.jar
> 
> $>open localhost:7199
> 
> $>bean org.apache.cassandra.net:type=Gossiper
> 
> $>run unsafeAssassinateEndpoint 192.168.1.18
> 
> $>quit
> 
> 
> Happy deleting
> 
> -Original Message-
> From: Alex [mailto:m...@aca-o.com]
> Sent: Wednesday, April 03, 2019 10:42 AM
> To: user@cassandra.apache.org
> Subject: Assassinate fails
> 
> Hello,
> 
> Short story:
> - I had to replace a dead node in my cluster
> - 1 week after, dead node is still seen as DN by 3 out of 5 nodes
> - dead node has null host_id
> - assassinate on dead node fails with error
> 
> How can I get rid of this dead node ?
> 
> 
> Long story:
> I had a 3 nodes cluster (Cassandra 3.9) ; one node went dead. I built
> a new node from scratch and "replaced" the dead node using the
> information from this page
> https://docs.datastax.com/en/cassandra/3.0/cassandra/operations/opsReplaceNode.html.
> It looked like the replacement went ok.
> 
> I added two more nodes to strengthen the cluster.
> 
> A few days have passed and the dead node is still visible and marked
> as "down" on 3 of 5 nodes in nodetool status:
> 
> --  Address   Load   Tokens   Owns (effective)  Host ID
>   Rack
> UN  192.168.1.9   16 GiB 256  35.0%
> 76223d4c-9d9f-417f-be27-cebb791cddcc  rack1
> UN  192.168.1.12  16.09 GiB  256  34.0%
> 719601e2-54a6-440e-a379-c9cf2dc20564  rack1
> UN  192.168.1.14  14.16 GiB  256  32.6%
> d8017a03-7e4e-47b7-89b9-cd9ec472d74f  rack1
> UN  192.168.1.17  15.4 GiB   256  34.1%
> fa238b21-1db1-47dc-bfb7-beedc6c9967a  rack1
> DN  192.168.1.18  24.3 GiB   256  33.7% null
>   rack1
> UN  192.168.1.22  19.06 GiB  256  30.7%
> 09d24557-4e98-44c3-8c9d-53c4c31066e1  rack1
> 
> Its host ID is nul

Re: Assassinate fails

2019-04-04 Thread Alex
Hi Anthony, 

Thanks for your help. 

I tried to run multiple times in quick succession but it fails with : 

-- StackTrace --
java.lang.RuntimeException: Endpoint still alive: /192.168.1.18
generation changed while trying to assassinate it
at
org.apache.cassandra.gms.Gossiper.assassinateEndpoint(Gossiper.java:592)


I can see that the generation number for this node increases by 1 every
time I call nodetool assassinate ; and the command itself waits for 30
seconds before assassinating node. When ran multiple times in quick
succession, the command fails because the generation number has been
changed by the previous instance. 

In 'nodetool gossipinfo', the node is marked as "LEFT" on every node. 

However, in 'nodetool describecluster', this node is marked as
"unreacheable" on 3 nodes out of 5. 

Alex 

Le 04.04.2019 00:56, Anthony Grasso a écrit :

> Hi Alex, 
> 
> We wrote a blog post on this topic late last year: 
> http://thelastpickle.com/blog/2018/09/18/assassinate.html. 
> 
> In short, you will need to run the assassinate command on each node 
> simultaneously a number of times in quick succession. This will generate a 
> number of messages requesting all nodes completely forget there used to be an 
> entry within the gossip state for the given IP address. 
> 
> Regards, 
> Anthony 
> 
> On Thu, 4 Apr 2019 at 03:32, Alex  wrote: 
> 
>> Same result it seems:
>> Welcome to JMX terminal. Type "help" for available commands.
>> $>open localhost:7199
>> #Connection to localhost:7199 is opened
>> $>bean org.apache.cassandra.net:type=Gossiper
>> #bean is set to org.apache.cassandra.net:type=Gossiper
>> $>run unsafeAssassinateEndpoint 192.168.1.18
>> #calling operation unsafeAssassinateEndpoint of mbean 
>> org.apache.cassandra.net:type=Gossiper
>> #RuntimeMBeanException: java.lang.NullPointerException
>> 
>> There not much more to see in log files :
>> WARN  [RMI TCP Connection(10)-127.0.0.1] 2019-04-03 16:25:13,626 
>> Gossiper.java:575 - Assassinating /192.168.1.18 [1] via gossip
>> INFO  [RMI TCP Connection(10)-127.0.0.1] 2019-04-03 16:25:13,627 
>> Gossiper.java:585 - Sleeping for 3ms to ensure /192.168.1.18 [1] does 
>> not change
>> INFO  [RMI TCP Connection(10)-127.0.0.1] 2019-04-03 16:25:43,628 
>> Gossiper.java:1029 - InetAddress /192.168.1.18 [1] is now DOWN
>> INFO  [RMI TCP Connection(10)-127.0.0.1] 2019-04-03 16:25:43,631 
>> StorageService.java:2324 - Removing tokens [..] for /192.168.1.18 [1]
>> 
>> Le 03.04.2019 17:10, Nick Hatfield a écrit :
>>> Run assassinate the old way. I works very well...
>>> 
>>> wget -q -O jmxterm.jar
>>> http://downloads.sourceforge.net/cyclops-group/jmxterm-1.0-alpha-4-uber.jar
>>> 
>>> java -jar ./jmxterm.jar
>>> 
>>> $>open localhost:7199
>>> 
>>> $>bean org.apache.cassandra.net:type=Gossiper
>>> 
>>> $>run unsafeAssassinateEndpoint 192.168.1.18
>>> 
>>> $>quit
>>> 
>>> 
>>> Happy deleting
>>> 
>>> -Original Message-
>>> From: Alex [mailto:m...@aca-o.com]
>>> Sent: Wednesday, April 03, 2019 10:42 AM
>>> To: user@cassandra.apache.org
>>> Subject: Assassinate fails
>>> 
>>> Hello,
>>> 
>>> Short story:
>>> - I had to replace a dead node in my cluster
>>> - 1 week after, dead node is still seen as DN by 3 out of 5 nodes
>>> - dead node has null host_id
>>> - assassinate on dead node fails with error
>>> 
>>> How can I get rid of this dead node ?
>>> 
>>> 
>>> Long story:
>>> I had a 3 nodes cluster (Cassandra 3.9) ; one node went dead. I built
>>> a new node from scratch and "replaced" the dead node using the
>>> information from this page
>>> https://docs.datastax.com/en/cassandra/3.0/cassandra/operations/opsReplaceNode.html.
>>> It looked like the replacement went ok.
>>> 
>>> I added two more nodes to strengthen the cluster.
>>> 
>>> A few days have passed and the dead node is still visible and marked
>>> as "down" on 3 of 5 nodes in nodetool status:
>>> 
>>> --  Address   Load   Tokens   Owns (effective)  Host ID
>>> Rack
>>> UN  192.168.1.9   16 GiB 256  35.0%
>>> 76223d4c-9d9f-417f-be27-cebb791cddcc  rack1
>>> UN  192.168.1.12  16.09 GiB  256  34.0%
>>> 719601e2-54a6-440e-a379-c9cf2dc20564  rack1
>>> UN  192.168.1.14  14.16 GiB  256  32.6%
>>> d8017a03-7e4e-47b7-8

Re: Assassinate fails

2019-04-03 Thread Anthony Grasso
Hi Alex,

We wrote a blog post on this topic late last year:
http://thelastpickle.com/blog/2018/09/18/assassinate.html.

In short, you will need to run the assassinate command on each node
simultaneously a number of times in quick succession. This will generate a
number of messages requesting all nodes completely forget there used to be
an entry within the gossip state for the given IP address.

Regards,
Anthony

On Thu, 4 Apr 2019 at 03:32, Alex  wrote:

> Same result it seems:
> Welcome to JMX terminal. Type "help" for available commands.
> $>open localhost:7199
> #Connection to localhost:7199 is opened
> $>bean org.apache.cassandra.net:type=Gossiper
> #bean is set to org.apache.cassandra.net:type=Gossiper
> $>run unsafeAssassinateEndpoint 192.168.1.18
> #calling operation unsafeAssassinateEndpoint of mbean
> org.apache.cassandra.net:type=Gossiper
> #RuntimeMBeanException: java.lang.NullPointerException
>
>
> There not much more to see in log files :
> WARN  [RMI TCP Connection(10)-127.0.0.1] 2019-04-03 16:25:13,626
> Gossiper.java:575 - Assassinating /192.168.1.18 via gossip
> INFO  [RMI TCP Connection(10)-127.0.0.1] 2019-04-03 16:25:13,627
> Gossiper.java:585 - Sleeping for 3ms to ensure /192.168.1.18 does
> not change
> INFO  [RMI TCP Connection(10)-127.0.0.1] 2019-04-03 16:25:43,628
> Gossiper.java:1029 - InetAddress /192.168.1.18 is now DOWN
> INFO  [RMI TCP Connection(10)-127.0.0.1] 2019-04-03 16:25:43,631
> StorageService.java:2324 - Removing tokens [..] for /192.168.1.18
>
>
>
>
> Le 03.04.2019 17:10, Nick Hatfield a écrit :
> > Run assassinate the old way. I works very well...
> >
> > wget -q -O jmxterm.jar
> >
> http://downloads.sourceforge.net/cyclops-group/jmxterm-1.0-alpha-4-uber.jar
> >
> > java -jar ./jmxterm.jar
> >
> > $>open localhost:7199
> >
> > $>bean org.apache.cassandra.net:type=Gossiper
> >
> > $>run unsafeAssassinateEndpoint 192.168.1.18
> >
> > $>quit
> >
> >
> > Happy deleting
> >
> > -Original Message-
> > From: Alex [mailto:m...@aca-o.com]
> > Sent: Wednesday, April 03, 2019 10:42 AM
> > To: user@cassandra.apache.org
> > Subject: Assassinate fails
> >
> > Hello,
> >
> > Short story:
> > - I had to replace a dead node in my cluster
> > - 1 week after, dead node is still seen as DN by 3 out of 5 nodes
> > - dead node has null host_id
> > - assassinate on dead node fails with error
> >
> > How can I get rid of this dead node ?
> >
> >
> > Long story:
> > I had a 3 nodes cluster (Cassandra 3.9) ; one node went dead. I built
> > a new node from scratch and "replaced" the dead node using the
> > information from this page
> >
> https://docs.datastax.com/en/cassandra/3.0/cassandra/operations/opsReplaceNode.html
> .
> > It looked like the replacement went ok.
> >
> > I added two more nodes to strengthen the cluster.
> >
> > A few days have passed and the dead node is still visible and marked
> > as "down" on 3 of 5 nodes in nodetool status:
> >
> > --  Address   Load   Tokens   Owns (effective)  Host ID
> >   Rack
> > UN  192.168.1.9   16 GiB 256  35.0%
> > 76223d4c-9d9f-417f-be27-cebb791cddcc  rack1
> > UN  192.168.1.12  16.09 GiB  256  34.0%
> > 719601e2-54a6-440e-a379-c9cf2dc20564  rack1
> > UN  192.168.1.14  14.16 GiB  256  32.6%
> > d8017a03-7e4e-47b7-89b9-cd9ec472d74f  rack1
> > UN  192.168.1.17  15.4 GiB   256  34.1%
> > fa238b21-1db1-47dc-bfb7-beedc6c9967a  rack1
> > DN  192.168.1.18  24.3 GiB   256  33.7% null
> >   rack1
> > UN  192.168.1.22  19.06 GiB  256  30.7%
> > 09d24557-4e98-44c3-8c9d-53c4c31066e1  rack1
> >
> > Its host ID is null, so I cannot use nodetool removenode. Moreover
> > nodetool assassinate 192.168.1.18 fails with :
> >
> > error: null
> > -- StackTrace --
> > java.lang.NullPointerException
> >
> > And in system.log:
> >
> > INFO  [RMI TCP Connection(16)-127.0.0.1] 2019-03-27 17:39:38,595
> > Gossiper.java:585 - Sleeping for 3ms to ensure /192.168.1.18 does
> > not change INFO  [CompactionExecutor:547] 2019-03-27 17:39:38,669
> > AutoSavingCache.java:393 - Saved KeyCache (27316 items) in 163 ms INFO
> >  [IndexSummaryManager:1] 2019-03-27 17:40:03,620
> > IndexSummaryRedistribution.java:75 - Redistributing index summaries
> > INFO  [RMI TCP Connection(16)-127.0.0.1] 2019-03-27 17:40:08,597
> > Gossiper.java:1029 

Re: Assassinate fails

2019-04-03 Thread Alex

Same result it seems:
Welcome to JMX terminal. Type "help" for available commands.
$>open localhost:7199
#Connection to localhost:7199 is opened
$>bean org.apache.cassandra.net:type=Gossiper
#bean is set to org.apache.cassandra.net:type=Gossiper
$>run unsafeAssassinateEndpoint 192.168.1.18
#calling operation unsafeAssassinateEndpoint of mbean 
org.apache.cassandra.net:type=Gossiper

#RuntimeMBeanException: java.lang.NullPointerException


There not much more to see in log files :
WARN  [RMI TCP Connection(10)-127.0.0.1] 2019-04-03 16:25:13,626 
Gossiper.java:575 - Assassinating /192.168.1.18 via gossip
INFO  [RMI TCP Connection(10)-127.0.0.1] 2019-04-03 16:25:13,627 
Gossiper.java:585 - Sleeping for 3ms to ensure /192.168.1.18 does 
not change
INFO  [RMI TCP Connection(10)-127.0.0.1] 2019-04-03 16:25:43,628 
Gossiper.java:1029 - InetAddress /192.168.1.18 is now DOWN
INFO  [RMI TCP Connection(10)-127.0.0.1] 2019-04-03 16:25:43,631 
StorageService.java:2324 - Removing tokens [..] for /192.168.1.18





Le 03.04.2019 17:10, Nick Hatfield a écrit :

Run assassinate the old way. I works very well...

wget -q -O jmxterm.jar
http://downloads.sourceforge.net/cyclops-group/jmxterm-1.0-alpha-4-uber.jar

java -jar ./jmxterm.jar

$>open localhost:7199

$>bean org.apache.cassandra.net:type=Gossiper

$>run unsafeAssassinateEndpoint 192.168.1.18

$>quit


Happy deleting

-Original Message-
From: Alex [mailto:m...@aca-o.com]
Sent: Wednesday, April 03, 2019 10:42 AM
To: user@cassandra.apache.org
Subject: Assassinate fails

Hello,

Short story:
- I had to replace a dead node in my cluster
- 1 week after, dead node is still seen as DN by 3 out of 5 nodes
- dead node has null host_id
- assassinate on dead node fails with error

How can I get rid of this dead node ?


Long story:
I had a 3 nodes cluster (Cassandra 3.9) ; one node went dead. I built
a new node from scratch and "replaced" the dead node using the
information from this page
https://docs.datastax.com/en/cassandra/3.0/cassandra/operations/opsReplaceNode.html.
It looked like the replacement went ok.

I added two more nodes to strengthen the cluster.

A few days have passed and the dead node is still visible and marked
as "down" on 3 of 5 nodes in nodetool status:

--  Address   Load   Tokens   Owns (effective)  Host ID
  Rack
UN  192.168.1.9   16 GiB 256  35.0%
76223d4c-9d9f-417f-be27-cebb791cddcc  rack1
UN  192.168.1.12  16.09 GiB  256  34.0%
719601e2-54a6-440e-a379-c9cf2dc20564  rack1
UN  192.168.1.14  14.16 GiB  256  32.6%
d8017a03-7e4e-47b7-89b9-cd9ec472d74f  rack1
UN  192.168.1.17  15.4 GiB   256  34.1%
fa238b21-1db1-47dc-bfb7-beedc6c9967a  rack1
DN  192.168.1.18  24.3 GiB   256  33.7% null
  rack1
UN  192.168.1.22  19.06 GiB  256  30.7%
09d24557-4e98-44c3-8c9d-53c4c31066e1  rack1

Its host ID is null, so I cannot use nodetool removenode. Moreover
nodetool assassinate 192.168.1.18 fails with :

error: null
-- StackTrace --
java.lang.NullPointerException

And in system.log:

INFO  [RMI TCP Connection(16)-127.0.0.1] 2019-03-27 17:39:38,595
Gossiper.java:585 - Sleeping for 3ms to ensure /192.168.1.18 does
not change INFO  [CompactionExecutor:547] 2019-03-27 17:39:38,669
AutoSavingCache.java:393 - Saved KeyCache (27316 items) in 163 ms INFO
 [IndexSummaryManager:1] 2019-03-27 17:40:03,620
IndexSummaryRedistribution.java:75 - Redistributing index summaries
INFO  [RMI TCP Connection(16)-127.0.0.1] 2019-03-27 17:40:08,597
Gossiper.java:1029 - InetAddress /192.168.1.18 is now DOWN INFO  [RMI
TCP Connection(16)-127.0.0.1] 2019-03-27 17:40:08,599
StorageService.java:2324 - Removing tokens [-1061369577393671924,...]
ERROR [GossipStage:1] 2019-03-27 17:40:08,600 CassandraDaemon.java:226
- Exception in thread Thread[GossipStage:1,5,main]
java.lang.NullPointerException: null


In system.peers, the dead node shows and has the same ID as the 
replacing node :


cqlsh> select peer, host_id from system.peers;

  peer | host_id
--+--
  192.168.1.18 | 09d24557-4e98-44c3-8c9d-53c4c31066e1
  192.168.1.22 | 09d24557-4e98-44c3-8c9d-53c4c31066e1
   192.168.1.9 | 76223d4c-9d9f-417f-be27-cebb791cddcc
  192.168.1.14 | d8017a03-7e4e-47b7-89b9-cd9ec472d74f
  192.168.1.12 | 719601e2-54a6-440e-a379-c9cf2dc20564

Dead node and replacing node have different tokens in system.peers.

I should add that I also tried decommission on a node that still
192.168.1.18 in its peers. - it is still marked as "leaving" 5 days
later. Nothing in notetool netstats or nodetool compactionstats.


Thank you for taking the time to read this. Hope you can help.

Alex

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org

RE: Assassinate fails

2019-04-03 Thread Nick Hatfield
Run assassinate the old way. I works very well...

wget -q -O jmxterm.jar 
http://downloads.sourceforge.net/cyclops-group/jmxterm-1.0-alpha-4-uber.jar

java -jar ./jmxterm.jar

$>open localhost:7199

$>bean org.apache.cassandra.net:type=Gossiper

$>run unsafeAssassinateEndpoint 192.168.1.18

$>quit


Happy deleting

-Original Message-
From: Alex [mailto:m...@aca-o.com] 
Sent: Wednesday, April 03, 2019 10:42 AM
To: user@cassandra.apache.org
Subject: Assassinate fails

Hello,

Short story:
- I had to replace a dead node in my cluster
- 1 week after, dead node is still seen as DN by 3 out of 5 nodes
- dead node has null host_id
- assassinate on dead node fails with error

How can I get rid of this dead node ?


Long story:
I had a 3 nodes cluster (Cassandra 3.9) ; one node went dead. I built a new 
node from scratch and "replaced" the dead node using the information from this 
page 
https://docs.datastax.com/en/cassandra/3.0/cassandra/operations/opsReplaceNode.html.
 
It looked like the replacement went ok.

I added two more nodes to strengthen the cluster.

A few days have passed and the dead node is still visible and marked as "down" 
on 3 of 5 nodes in nodetool status:

--  Address   Load   Tokens   Owns (effective)  Host ID  
  Rack
UN  192.168.1.9   16 GiB 256  35.0% 
76223d4c-9d9f-417f-be27-cebb791cddcc  rack1
UN  192.168.1.12  16.09 GiB  256  34.0% 
719601e2-54a6-440e-a379-c9cf2dc20564  rack1
UN  192.168.1.14  14.16 GiB  256  32.6% 
d8017a03-7e4e-47b7-89b9-cd9ec472d74f  rack1
UN  192.168.1.17  15.4 GiB   256  34.1% 
fa238b21-1db1-47dc-bfb7-beedc6c9967a  rack1
DN  192.168.1.18  24.3 GiB   256  33.7% null 
  rack1
UN  192.168.1.22  19.06 GiB  256  30.7% 
09d24557-4e98-44c3-8c9d-53c4c31066e1  rack1

Its host ID is null, so I cannot use nodetool removenode. Moreover nodetool 
assassinate 192.168.1.18 fails with :

error: null
-- StackTrace --
java.lang.NullPointerException

And in system.log:

INFO  [RMI TCP Connection(16)-127.0.0.1] 2019-03-27 17:39:38,595
Gossiper.java:585 - Sleeping for 3ms to ensure /192.168.1.18 does not 
change INFO  [CompactionExecutor:547] 2019-03-27 17:39:38,669
AutoSavingCache.java:393 - Saved KeyCache (27316 items) in 163 ms INFO  
[IndexSummaryManager:1] 2019-03-27 17:40:03,620
IndexSummaryRedistribution.java:75 - Redistributing index summaries INFO  [RMI 
TCP Connection(16)-127.0.0.1] 2019-03-27 17:40:08,597
Gossiper.java:1029 - InetAddress /192.168.1.18 is now DOWN INFO  [RMI TCP 
Connection(16)-127.0.0.1] 2019-03-27 17:40:08,599
StorageService.java:2324 - Removing tokens [-1061369577393671924,...] ERROR 
[GossipStage:1] 2019-03-27 17:40:08,600 CassandraDaemon.java:226 - Exception in 
thread Thread[GossipStage:1,5,main]
java.lang.NullPointerException: null


In system.peers, the dead node shows and has the same ID as the replacing node :

cqlsh> select peer, host_id from system.peers;

  peer | host_id
--+--
  192.168.1.18 | 09d24557-4e98-44c3-8c9d-53c4c31066e1
  192.168.1.22 | 09d24557-4e98-44c3-8c9d-53c4c31066e1
   192.168.1.9 | 76223d4c-9d9f-417f-be27-cebb791cddcc
  192.168.1.14 | d8017a03-7e4e-47b7-89b9-cd9ec472d74f
  192.168.1.12 | 719601e2-54a6-440e-a379-c9cf2dc20564

Dead node and replacing node have different tokens in system.peers.

I should add that I also tried decommission on a node that still
192.168.1.18 in its peers. - it is still marked as "leaving" 5 days later. 
Nothing in notetool netstats or nodetool compactionstats.


Thank you for taking the time to read this. Hope you can help.

Alex

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org




-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Assassinate fails

2019-04-03 Thread Alex

Hello,

Short story:
- I had to replace a dead node in my cluster
- 1 week after, dead node is still seen as DN by 3 out of 5 nodes
- dead node has null host_id
- assassinate on dead node fails with error

How can I get rid of this dead node ?


Long story:
I had a 3 nodes cluster (Cassandra 3.9) ; one node went dead. I built a 
new node from scratch and "replaced" the dead node using the information 
from this page 
https://docs.datastax.com/en/cassandra/3.0/cassandra/operations/opsReplaceNode.html. 
It looked like the replacement went ok.


I added two more nodes to strengthen the cluster.

A few days have passed and the dead node is still visible and marked as 
"down" on 3 of 5 nodes in nodetool status:


--  Address   Load   Tokens   Owns (effective)  Host ID  
 Rack
UN  192.168.1.9   16 GiB 256  35.0% 
76223d4c-9d9f-417f-be27-cebb791cddcc  rack1
UN  192.168.1.12  16.09 GiB  256  34.0% 
719601e2-54a6-440e-a379-c9cf2dc20564  rack1
UN  192.168.1.14  14.16 GiB  256  32.6% 
d8017a03-7e4e-47b7-89b9-cd9ec472d74f  rack1
UN  192.168.1.17  15.4 GiB   256  34.1% 
fa238b21-1db1-47dc-bfb7-beedc6c9967a  rack1
DN  192.168.1.18  24.3 GiB   256  33.7% null 
 rack1
UN  192.168.1.22  19.06 GiB  256  30.7% 
09d24557-4e98-44c3-8c9d-53c4c31066e1  rack1


Its host ID is null, so I cannot use nodetool removenode. Moreover 
nodetool assassinate 192.168.1.18 fails with :


error: null
-- StackTrace --
java.lang.NullPointerException

And in system.log:

INFO  [RMI TCP Connection(16)-127.0.0.1] 2019-03-27 17:39:38,595 
Gossiper.java:585 - Sleeping for 3ms to ensure /192.168.1.18 does 
not change
INFO  [CompactionExecutor:547] 2019-03-27 17:39:38,669 
AutoSavingCache.java:393 - Saved KeyCache (27316 items) in 163 ms
INFO  [IndexSummaryManager:1] 2019-03-27 17:40:03,620 
IndexSummaryRedistribution.java:75 - Redistributing index summaries
INFO  [RMI TCP Connection(16)-127.0.0.1] 2019-03-27 17:40:08,597 
Gossiper.java:1029 - InetAddress /192.168.1.18 is now DOWN
INFO  [RMI TCP Connection(16)-127.0.0.1] 2019-03-27 17:40:08,599 
StorageService.java:2324 - Removing tokens [-1061369577393671924,...]
ERROR [GossipStage:1] 2019-03-27 17:40:08,600 CassandraDaemon.java:226 - 
Exception in thread Thread[GossipStage:1,5,main]

java.lang.NullPointerException: null


In system.peers, the dead node shows and has the same ID as the 
replacing node :


cqlsh> select peer, host_id from system.peers;

 peer | host_id
--+--
 192.168.1.18 | 09d24557-4e98-44c3-8c9d-53c4c31066e1
 192.168.1.22 | 09d24557-4e98-44c3-8c9d-53c4c31066e1
  192.168.1.9 | 76223d4c-9d9f-417f-be27-cebb791cddcc
 192.168.1.14 | d8017a03-7e4e-47b7-89b9-cd9ec472d74f
 192.168.1.12 | 719601e2-54a6-440e-a379-c9cf2dc20564

Dead node and replacing node have different tokens in system.peers.

I should add that I also tried decommission on a node that still 
192.168.1.18 in its peers. - it is still marked as "leaving" 5 days 
later. Nothing in notetool netstats or nodetool compactionstats.



Thank you for taking the time to read this. Hope you can help.

Alex

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org