[ https://issues.apache.org/jira/browse/SOLR-12011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16379977#comment-16379977 ]
Cao Manh Dat edited comment on SOLR-12011 at 2/28/18 9:10 AM: -------------------------------------------------------------- Thank [~shalinmangar] 1, 3, 6 are correct 2, yeah it seems that \{{ if(isClosed) }} we can simply return 4, Can't be, in {{ZkController.register()}} we register term of replica first, then do join election after that. Besides that last published state is initialized to ACTIVE so when a core is first loaded on startup of a node, the flag is useless 5. Should we? Replica only sends PrepRecoveryOp to the leader after success updates its term. So I think a live-fetch on the leader's side will be enough. And I'm afraid that looping at that call can cause endlesslopp. ( I'm not sure about this point ) was (Author: caomanhdat): Thank [~shalinmangar] 1, 3, 6 are correct 2, yeah it seems that \{{ if(isClosed) }} we can simply return 4, Can't be, in {{ZkController.register()}} we register term of replica first, then do join election after that. Besides that last published state is initialized to ACTIVE so when a core is first loaded on startup of a node, the flag is useless 5. Should we? Replica only sends PrepRecoveryOp to the leader after success updates its term. So I think a live-fetch on the leader's side will be enough. And I'm afraid that looping at that call can cause race-condition. ( I'm not sure about this point ) > Consistence problem when in-sync replicas are DOWN > -------------------------------------------------- > > Key: SOLR-12011 > URL: https://issues.apache.org/jira/browse/SOLR-12011 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrCloud > Reporter: Cao Manh Dat > Assignee: Cao Manh Dat > Priority: Major > Attachments: SOLR-12011.patch > > > Currently, we will meet consistency problem when in-sync replicas are DOWN. > For example: > 1. A collection with 1 shard with 1 leader and 2 replicas > 2. Nodes contain 2 replicas go down > 3. The leader receives an update A, success > 4. The node contains the leader goes down > 5. 2 replicas come back > 6. One of them become leader --> But they shouldn't become leader since they > missed the update A > A solution to this issue : > * The idea here is using term value of each replica (SOLR-11702) will be > enough to tell that a replica received the latest updates or not. Therefore > only replicas with the highest term can become the leader. > * There are a couple of things need to be done on this issue > ** When leader receives the first updates, its term should be changed from 0 > -> 1, so further replicas added to the same shard won't be able to become > leader (their term = 0) until they finish recovery > ** For DOWN replicas, the leader should also need to check (in DUP.finish()) > that those replicas have term less than leader before return results to users > ** Just by looking at term value of replica, it is not enough to tell us > that replica is in-sync with leader or not. Because that replica might not > finish the recovery process. We need to introduce another flag (stored on > shard term node on ZK) to tell us that replica finished recovery or not. It > will look like this. > *** {"code_node1" : 1, "core_node2" : 0} — (when core_node2 start recovery) > ---> > *** {"core_node1" : 1, "core_node2" : 1, "core_node2_recovering" : 1} — > (when core_node2 finish recovery) ---> > *** {"core_node1" : 1, "core_node2" : 1} -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org