[jira] [Comment Edited] (SOLR-12011) Consistence problem when in-sync replicas are DOWN

Cao Manh Dat (JIRA) Wed, 28 Feb 2018 01:11:48 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-12011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16379977#comment-16379977
 ]


Cao Manh Dat edited comment on SOLR-12011 at 2/28/18 9:10 AM:
--------------------------------------------------------------

Thank [~shalinmangar]

1, 3, 6 are correct

2, yeah it seems that \{{ if(isClosed) }} we can simply return

4, Can't be, in {{ZkController.register()}} we register term of replica first, 
then do join election after that. Besides that last published state is 
initialized to ACTIVE so when a core is first loaded on startup of a node, the 
flag is useless

5. Should we? Replica only sends PrepRecoveryOp to the leader after success 
updates its term. So I think a live-fetch on the leader's side will be enough.  
And I'm afraid that looping at that call can cause endlesslopp. ( I'm not sure 
about this point )

 

 


was (Author: caomanhdat):
Thank [~shalinmangar]

1, 3, 6 are correct

2, yeah it seems that \{{ if(isClosed) }} we can simply return

4, Can't be, in {{ZkController.register()}} we register term of replica first, 
then do join election after that. Besides that last published state is 
initialized to ACTIVE so when a core is first loaded on startup of a node, the 
flag is useless

5. Should we? Replica only sends PrepRecoveryOp to the leader after success 
updates its term. So I think a live-fetch on the leader's side will be enough.  
And I'm afraid that looping at that call can cause race-condition. ( I'm not 
sure about this point )

 

 

> Consistence problem when in-sync replicas are DOWN
> --------------------------------------------------
>
>                 Key: SOLR-12011
>                 URL: https://issues.apache.org/jira/browse/SOLR-12011
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: SolrCloud
>            Reporter: Cao Manh Dat
>            Assignee: Cao Manh Dat
>            Priority: Major
>         Attachments: SOLR-12011.patch
>
>
> Currently, we will meet consistency problem when in-sync replicas are DOWN. 
> For example:
>  1. A collection with 1 shard with 1 leader and 2 replicas
>  2. Nodes contain 2 replicas go down
>  3. The leader receives an update A, success
>  4. The node contains the leader goes down
>  5. 2 replicas come back
>  6. One of them become leader --> But they shouldn't become leader since they 
> missed the update A
> A solution to this issue :
>  * The idea here is using term value of each replica (SOLR-11702) will be 
> enough to tell that a replica received the latest updates or not. Therefore 
> only replicas with the highest term can become the leader.
>  * There are a couple of things need to be done on this issue
>  ** When leader receives the first updates, its term should be changed from 0 
> -> 1, so further replicas added to the same shard won't be able to become 
> leader (their term = 0) until they finish recovery
>  ** For DOWN replicas, the leader should also need to check (in DUP.finish()) 
> that those replicas have term less than leader before return results to users
>  ** Just by looking at term value of replica, it is not enough to tell us 
> that replica is in-sync with leader or not. Because that replica might not 
> finish the recovery process. We need to introduce another flag (stored on 
> shard term node on ZK) to tell us that replica finished recovery or not. It 
> will look like this.
>  *** {"code_node1" : 1, "core_node2" : 0} — (when core_node2 start recovery) 
> --->
>  *** {"core_node1" : 1, "core_node2" : 1, "core_node2_recovering" : 1} — 
> (when core_node2 finish recovery) --->
>  *** {"core_node1" : 1, "core_node2" : 1}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Comment Edited] (SOLR-12011) Consistence problem when in-sync replicas are DOWN

Reply via email to