[jira] [Commented] (CASSANDRA-8138) replace_address cannot find node to be replaced node after seed node restart

2014-11-06 Thread Brandon Williams (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14200541#comment-14200541
 ] 

Brandon Williams commented on CASSANDRA-8138:
-

I think I'd much rather say that the edge case of a node dying, and then a full 
cluster restart (rolling would still work) is just not supported, rather than 
make such invasive changes to support replacement under such strange and rare 
conditions.  If that happens, it's time to assassinate the node and bootstrap 
another one.

> replace_address cannot find node to be replaced node after seed node restart
> 
>
> Key: CASSANDRA-8138
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8138
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Oleg Anastasyev
>Assignee: Brandon Williams
> Attachments: ReplaceAfterSeedRestart.txt
>
>
> If a node failed and a cluster was restarted (which is common case on massive 
> outages), replace_address fails with
> {code}
> Caused by: java.lang.RuntimeException: Cannot replace_address /172.19.56.97 
> because it doesn't exist in gossip
> jvm 1|at 
> org.apache.cassandra.service.StorageService.prepareReplacementInfo(StorageService.java:472)
> jvm 1|at 
> org.apache.cassandra.service.StorageService.joinTokenRing(StorageService.java:724)
> jvm 1|at 
> org.apache.cassandra.service.StorageService.initServer(StorageService.java:686)
> jvm 1|at 
> org.apache.cassandra.service.StorageService.initServer(StorageService.java:562)
> {code}
> Although neccessary information is saved in system tables on seed nodes, it 
> is not loaded to gossip on seed node, so a replacement node cannot get this 
> info.
> Attached patch loads all information from system tables to gossip with 
> generation 0 and fixes some bugs around this info on shadow gossip round.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8138) replace_address cannot find node to be replaced node after seed node restart

2014-10-21 Thread Oleg Anastasyev (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14178512#comment-14178512
 ] 

Oleg Anastasyev commented on CASSANDRA-8138:


This is because info about tokens, host id and DC:RACK of the dead node from 
system tables are loaded only into TokenMetadata on startup, but not to 
gossip's state. Loading code only calls Gossip.addSavedEndpoint(InetAddr) , 
which only adds an inet address of the dead node with generation 0.
If dead node did not participated in gossip since restart, there are no TOKENS, 
HOST_ID, etc app states for it in EndpointState. 
But replace_node, uses gossip shadow round to detect neccessary information 
about dead node, so it can replace it. And all it can get from gossip - is just 
its inet address. And actually there is a bug in Gossip.examineGossiper, which 
prevents this info to be sent to a replacing node as well, so in fact replacing 
node gets no information on this dead node at all, like it never existed 
before. 

I believe the same would apply to a bootrsrapping node, if there was full 
cluster restart after some node gone dead and a new node is being added to a 
cluster. And it would lead to wrong token metadata at freshly bootsrapped node 
(did not tested this case, through).

> replace_address cannot find node to be replaced node after seed node restart
> 
>
> Key: CASSANDRA-8138
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8138
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Oleg Anastasyev
> Attachments: ReplaceAfterSeedRestart.txt
>
>
> If a node failed and a cluster was restarted (which is common case on massive 
> outages), replace_address fails with
> {code}
> Caused by: java.lang.RuntimeException: Cannot replace_address /172.19.56.97 
> because it doesn't exist in gossip
> jvm 1|at 
> org.apache.cassandra.service.StorageService.prepareReplacementInfo(StorageService.java:472)
> jvm 1|at 
> org.apache.cassandra.service.StorageService.joinTokenRing(StorageService.java:724)
> jvm 1|at 
> org.apache.cassandra.service.StorageService.initServer(StorageService.java:686)
> jvm 1|at 
> org.apache.cassandra.service.StorageService.initServer(StorageService.java:562)
> {code}
> Although neccessary information is saved in system tables on seed nodes, it 
> is not loaded to gossip on seed node, so a replacement node cannot get this 
> info.
> Attached patch loads all information from system tables to gossip with 
> generation 0 and fixes some bugs around this info on shadow gossip round.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8138) replace_address cannot find node to be replaced node after seed node restart

2014-10-21 Thread Brandon Williams (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14178477#comment-14178477
 ] 

Brandon Williams commented on CASSANDRA-8138:
-

Can you explain why the dead node wasn't loaded from the system table?

> replace_address cannot find node to be replaced node after seed node restart
> 
>
> Key: CASSANDRA-8138
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8138
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Oleg Anastasyev
> Attachments: ReplaceAfterSeedRestart.txt
>
>
> If a node failed and a cluster was restarted (which is common case on massive 
> outages), replace_address fails with
> {code}
> Caused by: java.lang.RuntimeException: Cannot replace_address /172.19.56.97 
> because it doesn't exist in gossip
> jvm 1|at 
> org.apache.cassandra.service.StorageService.prepareReplacementInfo(StorageService.java:472)
> jvm 1|at 
> org.apache.cassandra.service.StorageService.joinTokenRing(StorageService.java:724)
> jvm 1|at 
> org.apache.cassandra.service.StorageService.initServer(StorageService.java:686)
> jvm 1|at 
> org.apache.cassandra.service.StorageService.initServer(StorageService.java:562)
> {code}
> Although neccessary information is saved in system tables on seed nodes, it 
> is not loaded to gossip on seed node, so a replacement node cannot get this 
> info.
> Attached patch loads all information from system tables to gossip with 
> generation 0 and fixes some bugs around this info on shadow gossip round.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8138) replace_address cannot find node to be replaced node after seed node restart

2014-10-20 Thread Oleg Anastasyev (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177401#comment-14177401
 ] 

Oleg Anastasyev commented on CASSANDRA-8138:


ah, sorry, a misleaded description. 

whole cluster was restarted, so failed node never participated in gossip since 
cluster startup. not restart of one of seeds. fixed description

> replace_address cannot find node to be replaced node after seed node restart
> 
>
> Key: CASSANDRA-8138
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8138
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Oleg Anastasyev
> Attachments: ReplaceAfterSeedRestart.txt
>
>
> If a node failed and a cluster (or one of seeds) was restarted (which is 
> common case on massive outages), replace_address fails with
> {code}
> Caused by: java.lang.RuntimeException: Cannot replace_address /172.19.56.97 
> because it doesn't exist in gossip
> jvm 1|at 
> org.apache.cassandra.service.StorageService.prepareReplacementInfo(StorageService.java:472)
> jvm 1|at 
> org.apache.cassandra.service.StorageService.joinTokenRing(StorageService.java:724)
> jvm 1|at 
> org.apache.cassandra.service.StorageService.initServer(StorageService.java:686)
> jvm 1|at 
> org.apache.cassandra.service.StorageService.initServer(StorageService.java:562)
> {code}
> Although neccessary information is saved in system tables on seed nodes, it 
> is not loaded to gossip on seed node, so a replacement node cannot get this 
> info.
> Attached patch loads all information from system tables to gossip with 
> generation 0 and fixes some bugs around this info on shadow gossip round.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8138) replace_address cannot find node to be replaced node after seed node restart

2014-10-20 Thread Brandon Williams (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177273#comment-14177273
 ] 

Brandon Williams commented on CASSANDRA-8138:
-

bq. If a node failed and a cluster (or one of seeds) was restarted (which is 
common case on massive outages), replace_address fails

This confuses me, since if the node was in gossip at all, the seed should get 
repopulated with it.  Can you explain the exact scenario here?

> replace_address cannot find node to be replaced node after seed node restart
> 
>
> Key: CASSANDRA-8138
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8138
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Oleg Anastasyev
> Attachments: ReplaceAfterSeedRestart.txt
>
>
> If a node failed and a cluster (or one of seeds) was restarted (which is 
> common case on massive outages), replace_address fails with
> {code}
> Caused by: java.lang.RuntimeException: Cannot replace_address /172.19.56.97 
> because it doesn't exist in gossip
> jvm 1|at 
> org.apache.cassandra.service.StorageService.prepareReplacementInfo(StorageService.java:472)
> jvm 1|at 
> org.apache.cassandra.service.StorageService.joinTokenRing(StorageService.java:724)
> jvm 1|at 
> org.apache.cassandra.service.StorageService.initServer(StorageService.java:686)
> jvm 1|at 
> org.apache.cassandra.service.StorageService.initServer(StorageService.java:562)
> {code}
> Although neccessary information is saved in system tables on seed nodes, it 
> is not loaded to gossip on seed node, so a replacement node cannot get this 
> info.
> Attached patch loads all information from system tables to gossip with 
> generation 0 and fixes some bugs around this info on shadow gossip round.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)