[jira] [Commented] (CASSANDRA-5916) gossip and tokenMetadata get hostId out of sync on failed replace_node with the same IP address
[ https://issues.apache.org/jira/browse/CASSANDRA-5916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13798300#comment-13798300 ] Tyler Hobbs commented on CASSANDRA-5916: That strategy sounds good to me in principle. I'm seeing a few problems when testing, though. If I start node4 with replace_address=node3 (while node3 is either up or down), I get an NPE: {noformat} DEBUG 14:01:33,359 Node /127.0.0.4 state normal, token [6564349027099416762] INFO 14:01:33,362 Node /127.0.0.4 state jump to normal ERROR 14:01:33,363 Exception encountered during startup java.lang.NullPointerException at org.apache.cassandra.gms.Gossiper.usesHostId(Gossiper.java:682) at org.apache.cassandra.gms.Gossiper.getHostId(Gossiper.java:694) at org.apache.cassandra.service.StorageService.handleStateNormal(StorageService.java:1382) at org.apache.cassandra.service.StorageService.onChange(StorageService.java:1250) at org.apache.cassandra.gms.Gossiper.doNotifications(Gossiper.java:973) at org.apache.cassandra.gms.Gossiper.addLocalApplicationState(Gossiper.java:1187) at org.apache.cassandra.service.StorageService.setTokens(StorageService.java:214) at org.apache.cassandra.service.StorageService.joinTokenRing(StorageService.java:824) at org.apache.cassandra.service.StorageService.initServer(StorageService.java:584) at org.apache.cassandra.service.StorageService.initServer(StorageService.java:481) at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:348) at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:447) at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:490) java.lang.NullPointerException at org.apache.cassandra.gms.Gossiper.usesHostId(Gossiper.java:682) at org.apache.cassandra.gms.Gossiper.getHostId(Gossiper.java:694) at org.apache.cassandra.service.StorageService.handleStateNormal(StorageService.java:1382) at org.apache.cassandra.service.StorageService.onChange(StorageService.java:1250) at org.apache.cassandra.gms.Gossiper.doNotifications(Gossiper.java:973) at org.apache.cassandra.gms.Gossiper.addLocalApplicationState(Gossiper.java:1187) at org.apache.cassandra.service.StorageService.setTokens(StorageService.java:214) at org.apache.cassandra.service.StorageService.joinTokenRing(StorageService.java:824) at org.apache.cassandra.service.StorageService.initServer(StorageService.java:584) at org.apache.cassandra.service.StorageService.initServer(StorageService.java:481) at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:348) at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:447) at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:490) Exception encountered during startup: null ERROR 14:01:33,368 Exception in thread Thread[StorageServiceShutdownHook,5,main] java.lang.NullPointerException at org.apache.cassandra.service.StorageService.stopRPCServer(StorageService.java:321) at org.apache.cassandra.service.StorageService.shutdownClientServers(StorageService.java:370) at org.apache.cassandra.service.StorageService.access$000(StorageService.java:88) at org.apache.cassandra.service.StorageService$1.runMayThrow(StorageService.java:549) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) at java.lang.Thread.run(Thread.java:724) {noformat} If I do replace_address with a non-existent node, after the ring delay sleep, I'll see: {noformat} java.lang.RuntimeException: Unable to gossip with any seeds {noformat} which is misleading, as that's not the actual problem. Perhaps we should explicitly check for presence of the address to replace? I've also seen that the node to replace can be the seed selected to gossip with, which results in this: {noformat} INFO 14:12:58,298 Gathering node replacement information for /127.0.0.3 INFO 14:12:58,302 Starting Messaging Service on port 7000 DEBUG 14:12:58,316 attempting to connect to /127.0.0.3 ERROR 14:13:29,320 Exception encountered during startup java.lang.RuntimeException: Unable to gossip with any seeds at org.apache.cassandra.gms.Gossiper.doShadowRound(Gossiper.java:1123) at org.apache.cassandra.service.StorageService.prepareReplacementInfo(StorageService.java:396) at org.apache.cassandra.service.StorageService.joinTokenRing(StorageService.java:603) at org.apache.cassandra.service.StorageService.initServer(StorageService.java:584) at org.apache.cassandra.service.StorageService.initServer(StorageService.java:481) at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:348) at
[jira] [Commented] (CASSANDRA-5916) gossip and tokenMetadata get hostId out of sync on failed replace_node with the same IP address
[ https://issues.apache.org/jira/browse/CASSANDRA-5916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13798461#comment-13798461 ] Tyler Hobbs commented on CASSANDRA-5916: Minor nitpick: you're missing a space before because in: {noformat} throw new RuntimeException(Cannot replace_address + DatabaseDescriptor.getReplaceAddress() + because it doesn't exist in gossip); {noformat} Other than that, +1 gossip and tokenMetadata get hostId out of sync on failed replace_node with the same IP address --- Key: CASSANDRA-5916 URL: https://issues.apache.org/jira/browse/CASSANDRA-5916 Project: Cassandra Issue Type: Bug Reporter: Brandon Williams Assignee: Brandon Williams Fix For: 1.2.12 Attachments: 5916.txt, 5916-v2.txt, 5916-v3.txt, 5916-v4.txt If you try to replace_node an existing, live hostId, it will error out. However if you're using an existing IP to do this (as in, you chose the wrong uuid to replace on accident) then the newly generated hostId wipes out the old one in TMD, and when you do try to replace it replace_node will complain it does not exist. Examination of gossipinfo still shows the old hostId, however now you can't replace it either. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (CASSANDRA-5916) gossip and tokenMetadata get hostId out of sync on failed replace_node with the same IP address
[ https://issues.apache.org/jira/browse/CASSANDRA-5916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13789734#comment-13789734 ] Tyler Hobbs commented on CASSANDRA-5916: I'm testing this out with a three-node ccm cluster. If I do the following: # (optional) stop node3 # add a blank node4 # start node4 with replace_address=127.0.0.3 I'll get the following: {noformat} ERROR 16:29:02,689 Exception encountered during startup java.lang.RuntimeException: Cannot replace_address /127.0.0.3because it doesn't exist in gossip at org.apache.cassandra.service.StorageService.prepareReplacementInfo(StorageService.java:421) at org.apache.cassandra.service.StorageService.joinTokenRing(StorageService.java:623) at org.apache.cassandra.service.StorageService.initServer(StorageService.java:604) at org.apache.cassandra.service.StorageService.initServer(StorageService.java:501) at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:348) at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:447) at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:490) java.lang.RuntimeException: Cannot replace_address /127.0.0.3because it doesn't exist in gossip at org.apache.cassandra.service.StorageService.prepareReplacementInfo(StorageService.java:421) at org.apache.cassandra.service.StorageService.joinTokenRing(StorageService.java:623) at org.apache.cassandra.service.StorageService.initServer(StorageService.java:604) at org.apache.cassandra.service.StorageService.initServer(StorageService.java:501) at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:348) at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:447) at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:490) Exception encountered during startup: Cannot replace_address /127.0.0.3because it doesn't exist in gossip ERROR 16:29:02,692 Exception in thread Thread[StorageServiceShutdownHook,5,main] java.lang.NullPointerException at org.apache.cassandra.service.StorageService.stopRPCServer(StorageService.java:321) at org.apache.cassandra.service.StorageService.shutdownClientServers(StorageService.java:370) at org.apache.cassandra.service.StorageService.access$000(StorageService.java:88) at org.apache.cassandra.service.StorageService$1.runMayThrow(StorageService.java:569) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) at java.lang.Thread.run(Thread.java:724) {noformat} This happens whether node3 is up or down. It seems like this problem occurs any time replace_address doesn't match the broadcast address. gossip and tokenMetadata get hostId out of sync on failed replace_node with the same IP address --- Key: CASSANDRA-5916 URL: https://issues.apache.org/jira/browse/CASSANDRA-5916 Project: Cassandra Issue Type: Bug Reporter: Brandon Williams Assignee: Brandon Williams Fix For: 1.2.11 Attachments: 5916.txt, 5916-v2.txt If you try to replace_node an existing, live hostId, it will error out. However if you're using an existing IP to do this (as in, you chose the wrong uuid to replace on accident) then the newly generated hostId wipes out the old one in TMD, and when you do try to replace it replace_node will complain it does not exist. Examination of gossipinfo still shows the old hostId, however now you can't replace it either. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (CASSANDRA-5916) gossip and tokenMetadata get hostId out of sync on failed replace_node with the same IP address
[ https://issues.apache.org/jira/browse/CASSANDRA-5916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13788523#comment-13788523 ] Brandon Williams commented on CASSANDRA-5916: - First, thanks for testing, [~ravilr]! bq. does it make sense to allow the operator to specify replace_token with the token(s) along with the replace_address to recover That could work, but I find it a bit ugly and confusing, especially since replace_token alone is supposed to work right now, but does not. bq. I think remaining in shadow mode may not work optimally well for cases where the node being replaced was down for more than hint window. So, all the nodes would have stopped hinting, and after replace, it would require repair to be ran to get the new data fed during the replace. That is true regardless of shadow mode though, since hibernate is a dead state and the node doesn't go live to reset the hint timer until the replace has completed. gossip and tokenMetadata get hostId out of sync on failed replace_node with the same IP address --- Key: CASSANDRA-5916 URL: https://issues.apache.org/jira/browse/CASSANDRA-5916 Project: Cassandra Issue Type: Bug Reporter: Brandon Williams Assignee: Brandon Williams Fix For: 1.2.11 Attachments: 5916.txt If you try to replace_node an existing, live hostId, it will error out. However if you're using an existing IP to do this (as in, you chose the wrong uuid to replace on accident) then the newly generated hostId wipes out the old one in TMD, and when you do try to replace it replace_node will complain it does not exist. Examination of gossipinfo still shows the old hostId, however now you can't replace it either. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (CASSANDRA-5916) gossip and tokenMetadata get hostId out of sync on failed replace_node with the same IP address
[ https://issues.apache.org/jira/browse/CASSANDRA-5916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13788570#comment-13788570 ] Ravi Prasad commented on CASSANDRA-5916: That is true regardless of shadow mode though, since hibernate is a dead state and the node doesn't go live to reset the hint timer until the replace has completed. my understanding is due to the generation change of the replacing node, gossiper.handleMajorStateChange marks the node as dead, as hibernate is one of the DEAD_STATES. So, the other nodes marks the replacing node as dead before the token bootstrap starts, hence should be storing hints to the replacing node from that point. gossip and tokenMetadata get hostId out of sync on failed replace_node with the same IP address --- Key: CASSANDRA-5916 URL: https://issues.apache.org/jira/browse/CASSANDRA-5916 Project: Cassandra Issue Type: Bug Reporter: Brandon Williams Assignee: Brandon Williams Fix For: 1.2.11 Attachments: 5916.txt If you try to replace_node an existing, live hostId, it will error out. However if you're using an existing IP to do this (as in, you chose the wrong uuid to replace on accident) then the newly generated hostId wipes out the old one in TMD, and when you do try to replace it replace_node will complain it does not exist. Examination of gossipinfo still shows the old hostId, however now you can't replace it either. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (CASSANDRA-5916) gossip and tokenMetadata get hostId out of sync on failed replace_node with the same IP address
[ https://issues.apache.org/jira/browse/CASSANDRA-5916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13788630#comment-13788630 ] Brandon Williams commented on CASSANDRA-5916: - You're right, it will change the endpoint's expire time and reset the window. That said, once the bootstrap has started the node should be receiving any incoming writes for the range it owns, so 'new' hints shouldn't matter in the common case where it succeeds. gossip and tokenMetadata get hostId out of sync on failed replace_node with the same IP address --- Key: CASSANDRA-5916 URL: https://issues.apache.org/jira/browse/CASSANDRA-5916 Project: Cassandra Issue Type: Bug Reporter: Brandon Williams Assignee: Brandon Williams Fix For: 1.2.11 Attachments: 5916.txt If you try to replace_node an existing, live hostId, it will error out. However if you're using an existing IP to do this (as in, you chose the wrong uuid to replace on accident) then the newly generated hostId wipes out the old one in TMD, and when you do try to replace it replace_node will complain it does not exist. Examination of gossipinfo still shows the old hostId, however now you can't replace it either. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (CASSANDRA-5916) gossip and tokenMetadata get hostId out of sync on failed replace_node with the same IP address
[ https://issues.apache.org/jira/browse/CASSANDRA-5916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13788709#comment-13788709 ] Ravi Prasad commented on CASSANDRA-5916: bq. once the bootstrap has started the node should be receiving any incoming writes for the range it owns, so 'new' hints shouldn't matter in the common case where it succeeds. Is this true for node bootstrapping in hibernate state? From what i have observed, writes to hibernate'd node during its bootstrap are not sent to it, as gossip marks that node down right. gossip and tokenMetadata get hostId out of sync on failed replace_node with the same IP address --- Key: CASSANDRA-5916 URL: https://issues.apache.org/jira/browse/CASSANDRA-5916 Project: Cassandra Issue Type: Bug Reporter: Brandon Williams Assignee: Brandon Williams Fix For: 1.2.11 Attachments: 5916.txt If you try to replace_node an existing, live hostId, it will error out. However if you're using an existing IP to do this (as in, you chose the wrong uuid to replace on accident) then the newly generated hostId wipes out the old one in TMD, and when you do try to replace it replace_node will complain it does not exist. Examination of gossipinfo still shows the old hostId, however now you can't replace it either. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (CASSANDRA-5916) gossip and tokenMetadata get hostId out of sync on failed replace_node
[ https://issues.apache.org/jira/browse/CASSANDRA-5916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13747853#comment-13747853 ] Brandon Williams commented on CASSANDRA-5916: - This isn't so much a problem with retrying the replace, as it is with the same IP address (which won't at all currently.) The reason for this is that by using the same IP address, the replacing node itself changes the HOST_ID, and then can't find the old one. It's not just as simple as not advertising a new HOST_ID either, by not having one but modifying STATUS we wipe out any existing HOST_ID as well. gossip and tokenMetadata get hostId out of sync on failed replace_node -- Key: CASSANDRA-5916 URL: https://issues.apache.org/jira/browse/CASSANDRA-5916 Project: Cassandra Issue Type: Bug Reporter: Brandon Williams Assignee: Brandon Williams Fix For: 1.2.9 If you try to replace_node an existing, live hostId, it will error out. However if you're using an existing IP to do this (as in, you chose the wrong uuid to replace on accident) then the newly generated hostId wipes out the old one in TMD, and when you do try to replace it replace_node will complain it does not exist. Examination of gossipinfo still shows the old hostId, however now you can't replace it either. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-5916) gossip and tokenMetadata get hostId out of sync on failed replace_node
[ https://issues.apache.org/jira/browse/CASSANDRA-5916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13746595#comment-13746595 ] Brandon Williams commented on CASSANDRA-5916: - The problem runs a little deeper, too: even if you specify the right uuid, and the replace fails for whatever reason, now they're out of sync again and you can't do the replace at all. gossip and tokenMetadata get hostId out of sync on failed replace_node -- Key: CASSANDRA-5916 URL: https://issues.apache.org/jira/browse/CASSANDRA-5916 Project: Cassandra Issue Type: Bug Reporter: Brandon Williams Assignee: Brandon Williams Fix For: 1.2.9 If you try to replace_node an existing, live hostId, it will error out. However if you're using an existing IP to do this (as in, you chose the wrong uuid to replace on accident) then the newly generated hostId wipes out the old one in TMD, and when you do try to replace it replace_node will complain it does not exist. Examination of gossipinfo still shows the old hostId, however now you can't replace it either. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-5916) gossip and tokenMetadata get hostId out of sync on failed replace_node
[ https://issues.apache.org/jira/browse/CASSANDRA-5916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13746737#comment-13746737 ] Brandon Williams commented on CASSANDRA-5916: - This same behavior also occurs with replace_token. gossip and tokenMetadata get hostId out of sync on failed replace_node -- Key: CASSANDRA-5916 URL: https://issues.apache.org/jira/browse/CASSANDRA-5916 Project: Cassandra Issue Type: Bug Reporter: Brandon Williams Assignee: Brandon Williams Fix For: 1.2.9 If you try to replace_node an existing, live hostId, it will error out. However if you're using an existing IP to do this (as in, you chose the wrong uuid to replace on accident) then the newly generated hostId wipes out the old one in TMD, and when you do try to replace it replace_node will complain it does not exist. Examination of gossipinfo still shows the old hostId, however now you can't replace it either. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira