[jira] [Commented] (CASSANDRA-5916) gossip and tokenMetadata get hostId out of sync on failed replace_node with the same IP address

2013-10-17 Thread Tyler Hobbs (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-5916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13798300#comment-13798300
 ] 

Tyler Hobbs commented on CASSANDRA-5916:


That strategy sounds good to me in principle.

I'm seeing a few problems when testing, though.

If I start node4 with replace_address=node3 (while node3 is either up or down), 
I get an NPE:

{noformat}
DEBUG 14:01:33,359 Node /127.0.0.4 state normal, token [6564349027099416762]
 INFO 14:01:33,362 Node /127.0.0.4 state jump to normal
ERROR 14:01:33,363 Exception encountered during startup
java.lang.NullPointerException
at org.apache.cassandra.gms.Gossiper.usesHostId(Gossiper.java:682)
at org.apache.cassandra.gms.Gossiper.getHostId(Gossiper.java:694)
at 
org.apache.cassandra.service.StorageService.handleStateNormal(StorageService.java:1382)
at 
org.apache.cassandra.service.StorageService.onChange(StorageService.java:1250)
at org.apache.cassandra.gms.Gossiper.doNotifications(Gossiper.java:973)
at 
org.apache.cassandra.gms.Gossiper.addLocalApplicationState(Gossiper.java:1187)
at 
org.apache.cassandra.service.StorageService.setTokens(StorageService.java:214)
at 
org.apache.cassandra.service.StorageService.joinTokenRing(StorageService.java:824)
at 
org.apache.cassandra.service.StorageService.initServer(StorageService.java:584)
at 
org.apache.cassandra.service.StorageService.initServer(StorageService.java:481)
at 
org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:348)
at 
org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:447)
at 
org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:490)
java.lang.NullPointerException
at org.apache.cassandra.gms.Gossiper.usesHostId(Gossiper.java:682)
at org.apache.cassandra.gms.Gossiper.getHostId(Gossiper.java:694)
at 
org.apache.cassandra.service.StorageService.handleStateNormal(StorageService.java:1382)
at 
org.apache.cassandra.service.StorageService.onChange(StorageService.java:1250)
at org.apache.cassandra.gms.Gossiper.doNotifications(Gossiper.java:973)
at 
org.apache.cassandra.gms.Gossiper.addLocalApplicationState(Gossiper.java:1187)
at 
org.apache.cassandra.service.StorageService.setTokens(StorageService.java:214)
at 
org.apache.cassandra.service.StorageService.joinTokenRing(StorageService.java:824)
at 
org.apache.cassandra.service.StorageService.initServer(StorageService.java:584)
at 
org.apache.cassandra.service.StorageService.initServer(StorageService.java:481)
at 
org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:348)
at 
org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:447)
at 
org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:490)
Exception encountered during startup: null
ERROR 14:01:33,368 Exception in thread Thread[StorageServiceShutdownHook,5,main]
java.lang.NullPointerException
at 
org.apache.cassandra.service.StorageService.stopRPCServer(StorageService.java:321)
at 
org.apache.cassandra.service.StorageService.shutdownClientServers(StorageService.java:370)
at 
org.apache.cassandra.service.StorageService.access$000(StorageService.java:88)
at 
org.apache.cassandra.service.StorageService$1.runMayThrow(StorageService.java:549)
at 
org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
at java.lang.Thread.run(Thread.java:724)
{noformat}

If I do replace_address with a non-existent node, after the ring delay sleep, 
I'll see:
{noformat}
java.lang.RuntimeException: Unable to gossip with any seeds
{noformat}
which is misleading, as that's not the actual problem.  Perhaps we should 
explicitly check for presence of the address to replace?

I've also seen that the node to replace can be the seed selected to gossip 
with, which results in this:
{noformat}
 INFO 14:12:58,298 Gathering node replacement information for /127.0.0.3
 INFO 14:12:58,302 Starting Messaging Service on port 7000
DEBUG 14:12:58,316 attempting to connect to /127.0.0.3
ERROR 14:13:29,320 Exception encountered during startup
java.lang.RuntimeException: Unable to gossip with any seeds
at org.apache.cassandra.gms.Gossiper.doShadowRound(Gossiper.java:1123)
at 
org.apache.cassandra.service.StorageService.prepareReplacementInfo(StorageService.java:396)
at 
org.apache.cassandra.service.StorageService.joinTokenRing(StorageService.java:603)
at 
org.apache.cassandra.service.StorageService.initServer(StorageService.java:584)
at 
org.apache.cassandra.service.StorageService.initServer(StorageService.java:481)
at 
org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:348)
at 

[jira] [Commented] (CASSANDRA-5916) gossip and tokenMetadata get hostId out of sync on failed replace_node with the same IP address

2013-10-17 Thread Tyler Hobbs (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-5916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13798461#comment-13798461
 ] 

Tyler Hobbs commented on CASSANDRA-5916:


Minor nitpick: you're missing a space before because in:
{noformat}
throw new RuntimeException(Cannot replace_address  + 
DatabaseDescriptor.getReplaceAddress() + because it doesn't exist in gossip);
{noformat}

Other than that, +1

 gossip and tokenMetadata get hostId out of sync on failed replace_node with 
 the same IP address
 ---

 Key: CASSANDRA-5916
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5916
 Project: Cassandra
  Issue Type: Bug
Reporter: Brandon Williams
Assignee: Brandon Williams
 Fix For: 1.2.12

 Attachments: 5916.txt, 5916-v2.txt, 5916-v3.txt, 5916-v4.txt


 If you try to replace_node an existing, live hostId, it will error out.  
 However if you're using an existing IP to do this (as in, you chose the wrong 
 uuid to replace on accident) then the newly generated hostId wipes out the 
 old one in TMD, and when you do try to replace it replace_node will complain 
 it does not exist.  Examination of gossipinfo still shows the old hostId, 
 however now you can't replace it either.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (CASSANDRA-5916) gossip and tokenMetadata get hostId out of sync on failed replace_node with the same IP address

2013-10-08 Thread Tyler Hobbs (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-5916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13789734#comment-13789734
 ] 

Tyler Hobbs commented on CASSANDRA-5916:


I'm testing this out with a three-node ccm cluster.  If I do the following:
# (optional) stop node3
# add a blank node4
# start node4 with replace_address=127.0.0.3

I'll get the following:
{noformat}
ERROR 16:29:02,689 Exception encountered during startup
java.lang.RuntimeException: Cannot replace_address /127.0.0.3because it doesn't 
exist in gossip
at 
org.apache.cassandra.service.StorageService.prepareReplacementInfo(StorageService.java:421)
at 
org.apache.cassandra.service.StorageService.joinTokenRing(StorageService.java:623)
at 
org.apache.cassandra.service.StorageService.initServer(StorageService.java:604)
at 
org.apache.cassandra.service.StorageService.initServer(StorageService.java:501)
at 
org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:348)
at 
org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:447)
at 
org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:490)
java.lang.RuntimeException: Cannot replace_address /127.0.0.3because it doesn't 
exist in gossip
at 
org.apache.cassandra.service.StorageService.prepareReplacementInfo(StorageService.java:421)
at 
org.apache.cassandra.service.StorageService.joinTokenRing(StorageService.java:623)
at 
org.apache.cassandra.service.StorageService.initServer(StorageService.java:604)
at 
org.apache.cassandra.service.StorageService.initServer(StorageService.java:501)
at 
org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:348)
at 
org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:447)
at 
org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:490)
Exception encountered during startup: Cannot replace_address /127.0.0.3because 
it doesn't exist in gossip
ERROR 16:29:02,692 Exception in thread Thread[StorageServiceShutdownHook,5,main]
java.lang.NullPointerException
at 
org.apache.cassandra.service.StorageService.stopRPCServer(StorageService.java:321)
at 
org.apache.cassandra.service.StorageService.shutdownClientServers(StorageService.java:370)
at 
org.apache.cassandra.service.StorageService.access$000(StorageService.java:88)
at 
org.apache.cassandra.service.StorageService$1.runMayThrow(StorageService.java:569)
at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
at java.lang.Thread.run(Thread.java:724)
{noformat}

This happens whether node3 is up or down.  It seems like this problem occurs 
any time replace_address doesn't match the broadcast address.

 gossip and tokenMetadata get hostId out of sync on failed replace_node with 
 the same IP address
 ---

 Key: CASSANDRA-5916
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5916
 Project: Cassandra
  Issue Type: Bug
Reporter: Brandon Williams
Assignee: Brandon Williams
 Fix For: 1.2.11

 Attachments: 5916.txt, 5916-v2.txt


 If you try to replace_node an existing, live hostId, it will error out.  
 However if you're using an existing IP to do this (as in, you chose the wrong 
 uuid to replace on accident) then the newly generated hostId wipes out the 
 old one in TMD, and when you do try to replace it replace_node will complain 
 it does not exist.  Examination of gossipinfo still shows the old hostId, 
 however now you can't replace it either.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (CASSANDRA-5916) gossip and tokenMetadata get hostId out of sync on failed replace_node with the same IP address

2013-10-07 Thread Brandon Williams (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-5916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13788523#comment-13788523
 ] 

Brandon Williams commented on CASSANDRA-5916:
-

First, thanks for testing, [~ravilr]!

bq. does it make sense to allow the operator to specify replace_token with the 
token(s) along with the replace_address to recover

That could work, but I find it a bit ugly and confusing, especially since 
replace_token alone is supposed to work right now, but does not.

bq. I think remaining in shadow mode may not work optimally well for cases 
where the node being replaced was down for more than hint window. So, all the 
nodes would have stopped hinting, and after replace, it would require repair to 
be ran to get the new data fed during the replace.

That is true regardless of shadow mode though, since hibernate is a dead state 
and the node doesn't go live to reset the hint timer until the replace has 
completed.

 gossip and tokenMetadata get hostId out of sync on failed replace_node with 
 the same IP address
 ---

 Key: CASSANDRA-5916
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5916
 Project: Cassandra
  Issue Type: Bug
Reporter: Brandon Williams
Assignee: Brandon Williams
 Fix For: 1.2.11

 Attachments: 5916.txt


 If you try to replace_node an existing, live hostId, it will error out.  
 However if you're using an existing IP to do this (as in, you chose the wrong 
 uuid to replace on accident) then the newly generated hostId wipes out the 
 old one in TMD, and when you do try to replace it replace_node will complain 
 it does not exist.  Examination of gossipinfo still shows the old hostId, 
 however now you can't replace it either.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (CASSANDRA-5916) gossip and tokenMetadata get hostId out of sync on failed replace_node with the same IP address

2013-10-07 Thread Ravi Prasad (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-5916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13788570#comment-13788570
 ] 

Ravi Prasad commented on CASSANDRA-5916:


That is true regardless of shadow mode though, since hibernate is a dead state 
and the node doesn't go live to reset the hint timer  until the replace has 
completed.

my understanding is due to the generation change of the replacing node, 
gossiper.handleMajorStateChange marks the node as dead, as hibernate is one of 
the DEAD_STATES. So, the other nodes marks the replacing node as dead before 
the token bootstrap starts, hence should be storing hints to the replacing node 
from that point.

 gossip and tokenMetadata get hostId out of sync on failed replace_node with 
 the same IP address
 ---

 Key: CASSANDRA-5916
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5916
 Project: Cassandra
  Issue Type: Bug
Reporter: Brandon Williams
Assignee: Brandon Williams
 Fix For: 1.2.11

 Attachments: 5916.txt


 If you try to replace_node an existing, live hostId, it will error out.  
 However if you're using an existing IP to do this (as in, you chose the wrong 
 uuid to replace on accident) then the newly generated hostId wipes out the 
 old one in TMD, and when you do try to replace it replace_node will complain 
 it does not exist.  Examination of gossipinfo still shows the old hostId, 
 however now you can't replace it either.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (CASSANDRA-5916) gossip and tokenMetadata get hostId out of sync on failed replace_node with the same IP address

2013-10-07 Thread Brandon Williams (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-5916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13788630#comment-13788630
 ] 

Brandon Williams commented on CASSANDRA-5916:
-

You're right, it will change the endpoint's expire time and reset the window.  
That said, once the bootstrap has started the node should be receiving any 
incoming writes for the range it owns, so 'new' hints shouldn't matter in the 
common case where it succeeds.

 gossip and tokenMetadata get hostId out of sync on failed replace_node with 
 the same IP address
 ---

 Key: CASSANDRA-5916
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5916
 Project: Cassandra
  Issue Type: Bug
Reporter: Brandon Williams
Assignee: Brandon Williams
 Fix For: 1.2.11

 Attachments: 5916.txt


 If you try to replace_node an existing, live hostId, it will error out.  
 However if you're using an existing IP to do this (as in, you chose the wrong 
 uuid to replace on accident) then the newly generated hostId wipes out the 
 old one in TMD, and when you do try to replace it replace_node will complain 
 it does not exist.  Examination of gossipinfo still shows the old hostId, 
 however now you can't replace it either.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (CASSANDRA-5916) gossip and tokenMetadata get hostId out of sync on failed replace_node with the same IP address

2013-10-07 Thread Ravi Prasad (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-5916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13788709#comment-13788709
 ] 

Ravi Prasad commented on CASSANDRA-5916:


bq. once the bootstrap has started the node should be receiving any incoming 
writes for the range it owns, so 'new' hints shouldn't matter in the common 
case where it succeeds.

Is this true for node bootstrapping in hibernate state? From what i have 
observed, writes to hibernate'd node during its bootstrap are not sent to it, 
as gossip marks that node down right. 



 gossip and tokenMetadata get hostId out of sync on failed replace_node with 
 the same IP address
 ---

 Key: CASSANDRA-5916
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5916
 Project: Cassandra
  Issue Type: Bug
Reporter: Brandon Williams
Assignee: Brandon Williams
 Fix For: 1.2.11

 Attachments: 5916.txt


 If you try to replace_node an existing, live hostId, it will error out.  
 However if you're using an existing IP to do this (as in, you chose the wrong 
 uuid to replace on accident) then the newly generated hostId wipes out the 
 old one in TMD, and when you do try to replace it replace_node will complain 
 it does not exist.  Examination of gossipinfo still shows the old hostId, 
 however now you can't replace it either.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (CASSANDRA-5916) gossip and tokenMetadata get hostId out of sync on failed replace_node

2013-08-22 Thread Brandon Williams (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-5916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13747853#comment-13747853
 ] 

Brandon Williams commented on CASSANDRA-5916:
-

This isn't so much a problem with retrying the replace, as it is with the same 
IP address (which won't at all currently.) The reason for this is that by using 
the same IP address, the replacing node itself changes the HOST_ID, and then 
can't find the old one.  It's not just as simple as not advertising a new 
HOST_ID either, by not having one but modifying STATUS we wipe out any existing 
HOST_ID as well.

 gossip and tokenMetadata get hostId out of sync on failed replace_node
 --

 Key: CASSANDRA-5916
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5916
 Project: Cassandra
  Issue Type: Bug
Reporter: Brandon Williams
Assignee: Brandon Williams
 Fix For: 1.2.9


 If you try to replace_node an existing, live hostId, it will error out.  
 However if you're using an existing IP to do this (as in, you chose the wrong 
 uuid to replace on accident) then the newly generated hostId wipes out the 
 old one in TMD, and when you do try to replace it replace_node will complain 
 it does not exist.  Examination of gossipinfo still shows the old hostId, 
 however now you can't replace it either.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (CASSANDRA-5916) gossip and tokenMetadata get hostId out of sync on failed replace_node

2013-08-21 Thread Brandon Williams (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-5916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13746595#comment-13746595
 ] 

Brandon Williams commented on CASSANDRA-5916:
-

The problem runs a little deeper, too: even if you specify the right uuid, and 
the replace fails for whatever reason, now they're out of sync again and you 
can't do the replace at all.

 gossip and tokenMetadata get hostId out of sync on failed replace_node
 --

 Key: CASSANDRA-5916
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5916
 Project: Cassandra
  Issue Type: Bug
Reporter: Brandon Williams
Assignee: Brandon Williams
 Fix For: 1.2.9


 If you try to replace_node an existing, live hostId, it will error out.  
 However if you're using an existing IP to do this (as in, you chose the wrong 
 uuid to replace on accident) then the newly generated hostId wipes out the 
 old one in TMD, and when you do try to replace it replace_node will complain 
 it does not exist.  Examination of gossipinfo still shows the old hostId, 
 however now you can't replace it either.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (CASSANDRA-5916) gossip and tokenMetadata get hostId out of sync on failed replace_node

2013-08-21 Thread Brandon Williams (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-5916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13746737#comment-13746737
 ] 

Brandon Williams commented on CASSANDRA-5916:
-

This same behavior also occurs with replace_token.

 gossip and tokenMetadata get hostId out of sync on failed replace_node
 --

 Key: CASSANDRA-5916
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5916
 Project: Cassandra
  Issue Type: Bug
Reporter: Brandon Williams
Assignee: Brandon Williams
 Fix For: 1.2.9


 If you try to replace_node an existing, live hostId, it will error out.  
 However if you're using an existing IP to do this (as in, you chose the wrong 
 uuid to replace on accident) then the newly generated hostId wipes out the 
 old one in TMD, and when you do try to replace it replace_node will complain 
 it does not exist.  Examination of gossipinfo still shows the old hostId, 
 however now you can't replace it either.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira