[ 
https://issues.apache.org/jira/browse/SOLR-12011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16384312#comment-16384312
 ] 

Cao Manh Dat edited comment on SOLR-12011 at 3/2/18 11:43 PM:
--------------------------------------------------------------

[~elyograg] Yeah, the shard just wait, with no leader at all, until the replica 
that got update comes back OR users use FORCE_LEADER API (if it never comes 
back). My idea for this problem (in a different ticket) is increasing the 
leaderVoteWait to 1hour as default and after that timeout, replicas just go 
ahead and become leader.


was (Author: caomanhdat):
[~elyograg] Yeah, the shard just wait, with no leader at all, until the replica 
that got update comes back OR users use FORCE_LEADER API (if it never comes 
back)

> Consistence problem when in-sync replicas are DOWN
> --------------------------------------------------
>
>                 Key: SOLR-12011
>                 URL: https://issues.apache.org/jira/browse/SOLR-12011
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: SolrCloud
>            Reporter: Cao Manh Dat
>            Assignee: Cao Manh Dat
>            Priority: Major
>         Attachments: SOLR-12011.patch, SOLR-12011.patch, SOLR-12011.patch, 
> SOLR-12011.patch, SOLR-12011.patch
>
>
> Currently, we will meet consistency problem when in-sync replicas are DOWN. 
> For example:
>  1. A collection with 1 shard with 1 leader and 2 replicas
>  2. Nodes contain 2 replicas go down
>  3. The leader receives an update A, success
>  4. The node contains the leader goes down
>  5. 2 replicas come back
>  6. One of them become leader --> But they shouldn't become leader since they 
> missed the update A
> A solution to this issue :
>  * The idea here is using term value of each replica (SOLR-11702) will be 
> enough to tell that a replica received the latest updates or not. Therefore 
> only replicas with the highest term can become the leader.
>  * There are a couple of things need to be done on this issue
>  ** When leader receives the first updates, its term should be changed from 0 
> -> 1, so further replicas added to the same shard won't be able to become 
> leader (their term = 0) until they finish recovery
>  ** For DOWN replicas, the leader should also need to check (in DUP.finish()) 
> that those replicas have term less than leader before return results to users
>  ** Just by looking at term value of replica, it is not enough to tell us 
> that replica is in-sync with leader or not. Because that replica might not 
> finish the recovery process. We need to introduce another flag (stored on 
> shard term node on ZK) to tell us that replica finished recovery or not. It 
> will look like this.
>  *** {"code_node1" : 1, "core_node2" : 0} — (when core_node2 start recovery) 
> --->
>  *** {"core_node1" : 1, "core_node2" : 1, "core_node2_recovering" : 1} — 
> (when core_node2 finish recovery) --->
>  *** {"core_node1" : 1, "core_node2" : 1}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to