[jira] [Commented] (CASSANDRA-5915) node flapping prevents replace_node from succeeding consistently

2015-04-01 Thread Philip Thompson (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-5915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390797#comment-14390797
 ] 

Philip Thompson commented on CASSANDRA-5915:


[~brandon.williams], do you think this could still be an issue?

 node flapping prevents replace_node from succeeding consistently
 

 Key: CASSANDRA-5915
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5915
 Project: Cassandra
  Issue Type: Bug
  Components: Core
 Environment: 1.2.8
Reporter: Chris Burroughs
 Attachments: cassandra.log.gz


 A node was down for a week or two due to hardware disk failure. I tried to 
 use replace_node to bring up a new node on the same physical host with the 
 same IPs. (rbranson suspected that using the same IP may be more issue 
 prone.) This failed due to unable to find sufficient sources for streaming 
 range  See CASSANDRA-5913 for a problem with how the failure was handled by 
 gossip.
 All of the other nodes should have been up the entire time, but when this 
 node came up it saw nodes flap up and down for quiet some time.  I was 
 eventually able to get replace_token to work by adding a 60 (!) second sleep 
 to StorageService:bootstrap.  I don't know if the right path is why are 
 things flapping so much or bootstrap should wait until things look stable.
 A few notes about the cluster:
  * 2 dc cluster (about 20 each), using GossipingPropertyFileSnitch
  * multi-dc no vpn setup: 
 http://mail-archives.apache.org/mod_mbox/cassandra-user/201306.mbox/%3c51bf5c79.7020...@gmail.com%3E
 Startup log from the successful (with sleep) replace_node attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-5915) node flapping prevents replace_node from succeeding consistently

2013-08-22 Thread Brandon Williams (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-5915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13747852#comment-13747852
 ] 

Brandon Williams commented on CASSANDRA-5915:
-

Ok, the same UUID does make sense, as long as the IP was not the same as the 
one being replaced, so I at least understand that part now. Empty STATUS or why 
more ring delay helped is still a mystery.

 node flapping prevents replace_node from succeeding consistently
 

 Key: CASSANDRA-5915
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5915
 Project: Cassandra
  Issue Type: Bug
  Components: Core
 Environment: 1.2.8
Reporter: Chris Burroughs
 Attachments: cassandra.log.gz


 A node was down for a week or two due to hardware disk failure. I tried to 
 use replace_node to bring up a new node on the same physical host with the 
 same IPs. (rbranson suspected that using the same IP may be more issue 
 prone.) This failed due to unable to find sufficient sources for streaming 
 range  See CASSANDRA-5913 for a problem with how the failure was handled by 
 gossip.
 All of the other nodes should have been up the entire time, but when this 
 node came up it saw nodes flap up and down for quiet some time.  I was 
 eventually able to get replace_token to work by adding a 60 (!) second sleep 
 to StorageService:bootstrap.  I don't know if the right path is why are 
 things flapping so much or bootstrap should wait until things look stable.
 A few notes about the cluster:
  * 2 dc cluster (about 20 each), using GossipingPropertyFileSnitch
  * multi-dc no vpn setup: 
 http://mail-archives.apache.org/mod_mbox/cassandra-user/201306.mbox/%3c51bf5c79.7020...@gmail.com%3E
 Startup log from the successful (with sleep) replace_node attached.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (CASSANDRA-5915) node flapping prevents replace_node from succeeding consistently

2013-08-21 Thread Brandon Williams (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-5915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13746604#comment-13746604
 ] 

Brandon Williams commented on CASSANDRA-5915:
-

I'm confused by how this worked at all the second time, because I encountered 
CASSANDRA-5916 in all attempts.  But I'm also confused how you can be missing 
the STATUS state, since a failed replace won't cause this, it will just leave 
the node with STATUS:hibernate,true. That said, I think the right path is why 
are things flapping so much since bootstrap already waits for RING_DELAY for 
things to stabilize, which at 30s should be plenty of time on any competent 
network. (And it's worth noting you can override ring delay next time if you 
need to wait longer.)

 node flapping prevents replace_node from succeeding consistently
 

 Key: CASSANDRA-5915
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5915
 Project: Cassandra
  Issue Type: Bug
  Components: Core
 Environment: 1.2.8
Reporter: Chris Burroughs
 Attachments: cassandra.log.gz


 A node was down for a week or two due to hardware disk failure. I tried to 
 use replace_node to bring up a new node on the same physical host with the 
 same IPs. (rbranson suspected that using the same IP may be more issue 
 prone.) This failed due to unable to find sufficient sources for streaming 
 range  See CASSANDRA-5913 for a problem with how the failure was handled by 
 gossip.
 All of the other nodes should have been up the entire time, but when this 
 node came up it saw nodes flap up and down for quiet some time.  I was 
 eventually able to get replace_token to work by adding a 60 (!) second sleep 
 to StorageService:bootstrap.  I don't know if the right path is why are 
 things flapping so much or bootstrap should wait until things look stable.
 A few notes about the cluster:
  * 2 dc cluster (about 20 each), using GossipingPropertyFileSnitch
  * multi-dc no vpn setup: 
 http://mail-archives.apache.org/mod_mbox/cassandra-user/201306.mbox/%3c51bf5c79.7020...@gmail.com%3E
 Startup log from the successful (with sleep) replace_node attached.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (CASSANDRA-5915) node flapping prevents replace_node from succeeding consistently

2013-08-21 Thread Chris Burroughs (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-5915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13746624#comment-13746624
 ] 

Chris Burroughs commented on CASSANDRA-5915:


FWIW it wasn't the second time.  Looks like I tried about 8 times in total, 
including I hope it works this time and the hacky increasing timeouts. Based 
on bash history I think the same uuid was used each time.

 node flapping prevents replace_node from succeeding consistently
 

 Key: CASSANDRA-5915
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5915
 Project: Cassandra
  Issue Type: Bug
  Components: Core
 Environment: 1.2.8
Reporter: Chris Burroughs
 Attachments: cassandra.log.gz


 A node was down for a week or two due to hardware disk failure. I tried to 
 use replace_node to bring up a new node on the same physical host with the 
 same IPs. (rbranson suspected that using the same IP may be more issue 
 prone.) This failed due to unable to find sufficient sources for streaming 
 range  See CASSANDRA-5913 for a problem with how the failure was handled by 
 gossip.
 All of the other nodes should have been up the entire time, but when this 
 node came up it saw nodes flap up and down for quiet some time.  I was 
 eventually able to get replace_token to work by adding a 60 (!) second sleep 
 to StorageService:bootstrap.  I don't know if the right path is why are 
 things flapping so much or bootstrap should wait until things look stable.
 A few notes about the cluster:
  * 2 dc cluster (about 20 each), using GossipingPropertyFileSnitch
  * multi-dc no vpn setup: 
 http://mail-archives.apache.org/mod_mbox/cassandra-user/201306.mbox/%3c51bf5c79.7020...@gmail.com%3E
 Startup log from the successful (with sleep) replace_node attached.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira