[jira] [Updated] (CASSANDRA-8336) Quarantine nodes after receiving the gossip shutdown message

2015-04-15 Thread Brandon Williams (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-8336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brandon Williams updated CASSANDRA-8336:

Fix Version/s: 2.1.5

 Quarantine nodes after receiving the gossip shutdown message
 

 Key: CASSANDRA-8336
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8336
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Brandon Williams
Assignee: Brandon Williams
 Fix For: 2.0.15, 2.1.5

 Attachments: 8336-v2.txt, 8336-v3.txt, 8336-v4.txt, 8336.txt, 
 8366-v5.txt


 In CASSANDRA-3936 we added a gossip shutdown announcement.  The problem here 
 is that this isn't sufficient; you can still get TOEs and have to wait on the 
 FD to figure things out.  This happens due to gossip propagation time and 
 variance; if node X shuts down and sends the message to Y, but Z has a 
 greater gossip version than Y for X and has not yet received the message, it 
 can initiate gossip with Y and thus mark X alive again.  I propose 
 quarantining to solve this, however I feel it should be a -D parameter you 
 have to specify, so as not to destroy current dev and test practices, since 
 this will mean a node that shuts down will not be able to restart until the 
 quarantine expires.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-8336) Quarantine nodes after receiving the gossip shutdown message

2015-04-13 Thread Brandon Williams (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-8336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brandon Williams updated CASSANDRA-8336:

Attachment: 8366-v5.txt

v5 creates a SILENT_SHUTDOWN_STATES list that stop() now checks, which is 
DEAD_STATES plus bootstrap and left.  Since LEFT is in there, we don't need 
stopSilently anymore, which I rather like.  Unfortunately the issue I 
previously mentioned about failed bootstraps is not the fault of this patch but 
a bug in 2.0 itself I happened to discover, which we can address in 
CASSANDRA-9180.

 Quarantine nodes after receiving the gossip shutdown message
 

 Key: CASSANDRA-8336
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8336
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Brandon Williams
Assignee: Brandon Williams
 Fix For: 2.0.15

 Attachments: 8336-v2.txt, 8336-v3.txt, 8336-v4.txt, 8336.txt, 
 8366-v5.txt


 In CASSANDRA-3936 we added a gossip shutdown announcement.  The problem here 
 is that this isn't sufficient; you can still get TOEs and have to wait on the 
 FD to figure things out.  This happens due to gossip propagation time and 
 variance; if node X shuts down and sends the message to Y, but Z has a 
 greater gossip version than Y for X and has not yet received the message, it 
 can initiate gossip with Y and thus mark X alive again.  I propose 
 quarantining to solve this, however I feel it should be a -D parameter you 
 have to specify, so as not to destroy current dev and test practices, since 
 this will mean a node that shuts down will not be able to restart until the 
 quarantine expires.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-8336) Quarantine nodes after receiving the gossip shutdown message

2015-03-31 Thread Brandon Williams (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-8336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brandon Williams updated CASSANDRA-8336:

Attachment: 8336-v4.txt

After wrestling with exceptions for a bit, I came up with a simpler solution.  
Gossiper's stop() can examine the local state itself, and skip shutdown 
announcement if it doesn't exist.  We still need stopSilently (which I renamed 
in this patch from stopForLeaving) for cases like decom, where we aren't coming 
back and don't wait to mutate our state on shutdown.

 Quarantine nodes after receiving the gossip shutdown message
 

 Key: CASSANDRA-8336
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8336
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Brandon Williams
Assignee: Brandon Williams
 Fix For: 2.0.14

 Attachments: 8336-v2.txt, 8336-v3.txt, 8336-v4.txt, 8336.txt


 In CASSANDRA-3936 we added a gossip shutdown announcement.  The problem here 
 is that this isn't sufficient; you can still get TOEs and have to wait on the 
 FD to figure things out.  This happens due to gossip propagation time and 
 variance; if node X shuts down and sends the message to Y, but Z has a 
 greater gossip version than Y for X and has not yet received the message, it 
 can initiate gossip with Y and thus mark X alive again.  I propose 
 quarantining to solve this, however I feel it should be a -D parameter you 
 have to specify, so as not to destroy current dev and test practices, since 
 this will mean a node that shuts down will not be able to restart until the 
 quarantine expires.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-8336) Quarantine nodes after receiving the gossip shutdown message

2015-02-03 Thread Brandon Williams (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-8336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brandon Williams updated CASSANDRA-8336:

Attachment: 8336-v3.txt

v3 addresses the previous issues.  It turns out for the first problem, the 
simplest thing to do is not make shutdown a dead state, and instead special 
case detection of it in handleMajorStateChange at the very end.

 Quarantine nodes after receiving the gossip shutdown message
 

 Key: CASSANDRA-8336
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8336
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Brandon Williams
Assignee: Brandon Williams
 Fix For: 2.0.13

 Attachments: 8336-v2.txt, 8336-v3.txt, 8336.txt


 In CASSANDRA-3936 we added a gossip shutdown announcement.  The problem here 
 is that this isn't sufficient; you can still get TOEs and have to wait on the 
 FD to figure things out.  This happens due to gossip propagation time and 
 variance; if node X shuts down and sends the message to Y, but Z has a 
 greater gossip version than Y for X and has not yet received the message, it 
 can initiate gossip with Y and thus mark X alive again.  I propose 
 quarantining to solve this, however I feel it should be a -D parameter you 
 have to specify, so as not to destroy current dev and test practices, since 
 this will mean a node that shuts down will not be able to restart until the 
 quarantine expires.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-8336) Quarantine nodes after receiving the gossip shutdown message

2015-02-03 Thread Brandon Williams (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-8336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brandon Williams updated CASSANDRA-8336:

Attachment: 8336-v3.txt

v3 addresses the previous issues.  It turns out for the first problem, the 
simplest thing to do is not make shutdown a dead state, and instead special 
case detection of it in handleMajorStateChange at the very end.  SS treats 
shutdown nodes as normal in the onJoin event, and then immediately afterward 
receives the event that it is shutdown.

 Quarantine nodes after receiving the gossip shutdown message
 

 Key: CASSANDRA-8336
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8336
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Brandon Williams
Assignee: Brandon Williams
 Fix For: 2.0.13

 Attachments: 8336-v2.txt, 8336-v3.txt, 8336.txt


 In CASSANDRA-3936 we added a gossip shutdown announcement.  The problem here 
 is that this isn't sufficient; you can still get TOEs and have to wait on the 
 FD to figure things out.  This happens due to gossip propagation time and 
 variance; if node X shuts down and sends the message to Y, but Z has a 
 greater gossip version than Y for X and has not yet received the message, it 
 can initiate gossip with Y and thus mark X alive again.  I propose 
 quarantining to solve this, however I feel it should be a -D parameter you 
 have to specify, so as not to destroy current dev and test practices, since 
 this will mean a node that shuts down will not be able to restart until the 
 quarantine expires.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-8336) Quarantine nodes after receiving the gossip shutdown message

2015-02-03 Thread Brandon Williams (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-8336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brandon Williams updated CASSANDRA-8336:

Attachment: (was: 8336-v3.txt)

 Quarantine nodes after receiving the gossip shutdown message
 

 Key: CASSANDRA-8336
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8336
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Brandon Williams
Assignee: Brandon Williams
 Fix For: 2.0.13

 Attachments: 8336-v2.txt, 8336.txt


 In CASSANDRA-3936 we added a gossip shutdown announcement.  The problem here 
 is that this isn't sufficient; you can still get TOEs and have to wait on the 
 FD to figure things out.  This happens due to gossip propagation time and 
 variance; if node X shuts down and sends the message to Y, but Z has a 
 greater gossip version than Y for X and has not yet received the message, it 
 can initiate gossip with Y and thus mark X alive again.  I propose 
 quarantining to solve this, however I feel it should be a -D parameter you 
 have to specify, so as not to destroy current dev and test practices, since 
 this will mean a node that shuts down will not be able to restart until the 
 quarantine expires.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-8336) Quarantine nodes after receiving the gossip shutdown message

2015-01-22 Thread Brandon Williams (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-8336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brandon Williams updated CASSANDRA-8336:

Attachment: 8336-v2.txt

This patch helps, but the problem with this is approach is the node can still 
flap, given a disjoint enough (gossip state-wise) cluster.  There are a few 
ways we can solve this:

* quarantine after shutdown.  This has the consequence of not being able to 
restart a node until the quarantine expires.

* Sleep for ring_delay or some interval after setting the shutdown state before 
sending the rpc shutdown.  I'm not 100% sure this would prevent the flapping, 
and sleeping that long on shutdown sucks as equally as not being able to reboot 
until the quarantine expires.

* Offline Richard suggested to me a third way, which I'll discuss below.

The method suggests when node X receives a shutdown event from Y, it will 
update its local state for Y to version Integer.MAX_VALUE, and thus no updates 
for the same generation will be accepted since they will always have a lower 
version.  When Y restarts it will have a new generation and everything will 
work normally.  

There is one consequence to this method, and that is that gossipdisable/enable 
has to now generate a new generation, which triggers the has restarted, now 
UP message on other nodes, but this seems like a fairly minor thing.

On the surface, it may seem easier to have Y just send with a version of 
MAX_VALUE, but that will only apply to nodes that receive it via gossip, not 
the ones that receive it via rpc which is likely the bulk of them, and it 
wouldn't be an optimization anyway since we only sleep for one gossip round, 
and the node(s) we gossip to will set the version anyway before propagating it 
to the rest of the cluster.

v2 does this.

 Quarantine nodes after receiving the gossip shutdown message
 

 Key: CASSANDRA-8336
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8336
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Brandon Williams
Assignee: Brandon Williams
 Fix For: 2.0.13

 Attachments: 8336-v2.txt, 8336.txt


 In CASSANDRA-3936 we added a gossip shutdown announcement.  The problem here 
 is that this isn't sufficient; you can still get TOEs and have to wait on the 
 FD to figure things out.  This happens due to gossip propagation time and 
 variance; if node X shuts down and sends the message to Y, but Z has a 
 greater gossip version than Y for X and has not yet received the message, it 
 can initiate gossip with Y and thus mark X alive again.  I propose 
 quarantining to solve this, however I feel it should be a -D parameter you 
 have to specify, so as not to destroy current dev and test practices, since 
 this will mean a node that shuts down will not be able to restart until the 
 quarantine expires.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-8336) Quarantine nodes after receiving the gossip shutdown message

2015-01-22 Thread Brandon Williams (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-8336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brandon Williams updated CASSANDRA-8336:

Attachment: (was: 8336-v2.txt)

 Quarantine nodes after receiving the gossip shutdown message
 

 Key: CASSANDRA-8336
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8336
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Brandon Williams
Assignee: Brandon Williams
 Fix For: 2.0.13

 Attachments: 8336-v2.txt, 8336.txt


 In CASSANDRA-3936 we added a gossip shutdown announcement.  The problem here 
 is that this isn't sufficient; you can still get TOEs and have to wait on the 
 FD to figure things out.  This happens due to gossip propagation time and 
 variance; if node X shuts down and sends the message to Y, but Z has a 
 greater gossip version than Y for X and has not yet received the message, it 
 can initiate gossip with Y and thus mark X alive again.  I propose 
 quarantining to solve this, however I feel it should be a -D parameter you 
 have to specify, so as not to destroy current dev and test practices, since 
 this will mean a node that shuts down will not be able to restart until the 
 quarantine expires.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-8336) Quarantine nodes after receiving the gossip shutdown message

2015-01-22 Thread Brandon Williams (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-8336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brandon Williams updated CASSANDRA-8336:

Attachment: 8336-v2.txt

 Quarantine nodes after receiving the gossip shutdown message
 

 Key: CASSANDRA-8336
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8336
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Brandon Williams
Assignee: Brandon Williams
 Fix For: 2.0.13

 Attachments: 8336-v2.txt, 8336.txt


 In CASSANDRA-3936 we added a gossip shutdown announcement.  The problem here 
 is that this isn't sufficient; you can still get TOEs and have to wait on the 
 FD to figure things out.  This happens due to gossip propagation time and 
 variance; if node X shuts down and sends the message to Y, but Z has a 
 greater gossip version than Y for X and has not yet received the message, it 
 can initiate gossip with Y and thus mark X alive again.  I propose 
 quarantining to solve this, however I feel it should be a -D parameter you 
 have to specify, so as not to destroy current dev and test practices, since 
 this will mean a node that shuts down will not be able to restart until the 
 quarantine expires.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-8336) Quarantine nodes after receiving the gossip shutdown message

2015-01-20 Thread T Jake Luciani (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-8336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

T Jake Luciani updated CASSANDRA-8336:
--
Fix Version/s: (was: 2.0.12)
   2.0.13

 Quarantine nodes after receiving the gossip shutdown message
 

 Key: CASSANDRA-8336
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8336
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Brandon Williams
Assignee: Brandon Williams
 Fix For: 2.0.13


 In CASSANDRA-3936 we added a gossip shutdown announcement.  The problem here 
 is that this isn't sufficient; you can still get TOEs and have to wait on the 
 FD to figure things out.  This happens due to gossip propagation time and 
 variance; if node X shuts down and sends the message to Y, but Z has a 
 greater gossip version than Y for X and has not yet received the message, it 
 can initiate gossip with Y and thus mark X alive again.  I propose 
 quarantining to solve this, however I feel it should be a -D parameter you 
 have to specify, so as not to destroy current dev and test practices, since 
 this will mean a node that shuts down will not be able to restart until the 
 quarantine expires.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-8336) Quarantine nodes after receiving the gossip shutdown message

2015-01-20 Thread Brandon Williams (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-8336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brandon Williams updated CASSANDRA-8336:

Attachment: 8336.txt

Patch to take the hibernate approach, but use a new STATUS VV called SHUTDOWN.  
One reason we need this (not just for operator clarity) is that we have to 
special case it for Gossiper.convict, since if it receives our shutdown state 
passively over gossip before it receives it actively over RPC it will be in a 
dead state and left untouched. We can also race the other way (rpc before 
passive) but the isAlive check prevents us from doubly marking it down.

In a mixed cluster, this still preserves the old behavior, since they won't 
know what 'shutdown' means and just rely on the active rpc method to mark the 
node down.

 Quarantine nodes after receiving the gossip shutdown message
 

 Key: CASSANDRA-8336
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8336
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Brandon Williams
Assignee: Brandon Williams
 Fix For: 2.0.13

 Attachments: 8336.txt


 In CASSANDRA-3936 we added a gossip shutdown announcement.  The problem here 
 is that this isn't sufficient; you can still get TOEs and have to wait on the 
 FD to figure things out.  This happens due to gossip propagation time and 
 variance; if node X shuts down and sends the message to Y, but Z has a 
 greater gossip version than Y for X and has not yet received the message, it 
 can initiate gossip with Y and thus mark X alive again.  I propose 
 quarantining to solve this, however I feel it should be a -D parameter you 
 have to specify, so as not to destroy current dev and test practices, since 
 this will mean a node that shuts down will not be able to restart until the 
 quarantine expires.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)