[jira] [Commented] (SOLR-6923) kill -9 doesn't change the replica state in clusterstate.json
[ https://issues.apache.org/jira/browse/SOLR-6923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14272953#comment-14272953 ] Varun Thacker commented on SOLR-6923: - Thanks Tim for pointing it out. I was not aware of this. I'll rename the issue appropriately with this information and come up up with a patch for AutoAddReplicas to consult live nodes too. kill -9 doesn't change the replica state in clusterstate.json - Key: SOLR-6923 URL: https://issues.apache.org/jira/browse/SOLR-6923 Project: Solr Issue Type: Bug Reporter: Varun Thacker - I did the following {code} ./solr start -e cloud -noprompt kill -9 pid-of-node2 //Not the node which is running ZK {code} - /live_nodes reflects that the node is gone. - This is the only message which gets logged on the node1 server after killing node2 {code} 45812 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:9983] WARN org.apache.zookeeper.server.NIOServerCnxn – caught end of stream exception EndOfStreamException: Unable to read additional data from client sessionid 0x14ac40f26660001, likely client has closed socket at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:228) at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208) at java.lang.Thread.run(Thread.java:745) {code} - The graph shows the node2 as 'Gone' state - clusterstate.json keeps showing the replica as 'active' {code} {collection1:{ shards:{shard1:{ range:8000-7fff, state:active, replicas:{ core_node1:{ state:active, core:collection1, node_name:169.254.113.194:8983_solr, base_url:http://169.254.113.194:8983/solr;, leader:true}, core_node2:{ state:active, core:collection1, node_name:169.254.113.194:8984_solr, base_url:http://169.254.113.194:8984/solr, maxShardsPerNode:1, router:{name:compositeId}, replicationFactor:1, autoAddReplicas:false, autoCreated:true}} {code} One immediate problem I can see is that AutoAddReplicas doesn't work since the clusterstate.json never changes. There might be more features which are affected by this. On first thought I think we can handle this - The shard leader could listen to changes on /live_nodes and if it has replicas that were on that node, mark it as 'down' in the clusterstate.json? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-6923) kill -9 doesn't change the replica state in clusterstate.json
[ https://issues.apache.org/jira/browse/SOLR-6923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268278#comment-14268278 ] Timothy Potter commented on SOLR-6923: -- The actual runtime state of a replica is determined by 1) what's in clusterstate.json and 2) check that the node hosting the replica is live. If the node is not live, the state reported in clusterstate.json can be stale for some time. It has always worked this way in SolrCloud. Thus, AutoAddReplicas needs to consult live nodes prior to thinking a node is live. kill -9 doesn't change the replica state in clusterstate.json - Key: SOLR-6923 URL: https://issues.apache.org/jira/browse/SOLR-6923 Project: Solr Issue Type: Bug Reporter: Varun Thacker - I did the following {code} ./solr start -e cloud -noprompt kill -9 pid-of-node2 //Not the node which is running ZK {code} - /live_nodes reflects that the node is gone. - This is the only message which gets logged on the node1 server after killing node2 {code} 45812 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:9983] WARN org.apache.zookeeper.server.NIOServerCnxn – caught end of stream exception EndOfStreamException: Unable to read additional data from client sessionid 0x14ac40f26660001, likely client has closed socket at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:228) at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208) at java.lang.Thread.run(Thread.java:745) {code} - The graph shows the node2 as 'Gone' state - clusterstate.json keeps showing the replica as 'active' {code} {collection1:{ shards:{shard1:{ range:8000-7fff, state:active, replicas:{ core_node1:{ state:active, core:collection1, node_name:169.254.113.194:8983_solr, base_url:http://169.254.113.194:8983/solr;, leader:true}, core_node2:{ state:active, core:collection1, node_name:169.254.113.194:8984_solr, base_url:http://169.254.113.194:8984/solr, maxShardsPerNode:1, router:{name:compositeId}, replicationFactor:1, autoAddReplicas:false, autoCreated:true}} {code} One immediate problem I can see is that AutoAddReplicas doesn't work since the clusterstate.json never changes. There might be more features which are affected by this. On first thought I think we can handle this - The shard leader could listen to changes on /live_nodes and if it has replicas that were on that node, mark it as 'down' in the clusterstate.json? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org