[jira] [Comment Edited] (SOLR-12011) Consistence problem when in-sync replicas are DOWN

Shalin Shekhar Mangar (JIRA) Wed, 28 Feb 2018 00:20:36 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-12011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16379927#comment-16379927
 ]


Shalin Shekhar Mangar edited comment on SOLR-12011 at 2/28/18 8:19 AM:
-----------------------------------------------------------------------

Thanks Dat. A few comments:
# The line log.info("skip url:{} cause its term is less than leader", 
replica.getCoreUrl()); will be logged on each update request during the time 
the other replicas don't have the same term as leader. Perhaps this should be 
debug level.
# ElectionContext has {{if (weAreReplacement && isClosed)}}. Did you mean 
{{!isClosed}}?
# ElectionContext has {{getReplicas(EnumSet.of(Replica.Type.TLOG, 
Replica.Type.TLOG)}}. Perhaps you meant TLOG and NRT?
# ElectionContext has replaced shouldIBeLeader() which has a check for last 
published state being active. I'm curious if there can be a condition where the 
term is not registered and last published state is not active and therefore the 
replica becomes a leader.
# PrepRecoveryOp refreshes terms if 
{{shardTerms.skipSendingUpdatesTo(coreNodeName)}} return true. But should it 
not wait for skip status to go away in a loop? The reason behind PrepRecovery 
is that we ensure that when the call to prep recovery returns, the leader has 
already seen the {{waitForState}} state and therefore is already forwarding the 
updates to the recoverying replica. Now that the behavior is changed to forward 
updates only after the term is equal and not depend on seeing 'recoverying' 
state, we should change PrepRecovery as well.
# Add a comment before calling {{getShardTerms(collection, 
shardId).startRecovering(coreNodeName);}} and {{getShardTerms(collection, 
shardId).doneRecovering(coreNodeName);}} in ZkController.publish() describing 
why it is necessary and why only PULL replicas are excluded. I understand the 
reason but it can be confusing to others reading this code


was (Author: shalinmangar):
Thanks Dat. A few comments:
# The line log.info("skip url:{} cause its term is less than leader", 
replica.getCoreUrl()); will be logged on each update request during the time 
the other replicas don't have the same term as leader. Perhaps this should be 
debug level.
# ElectionContext has {{if (weAreReplacement && isClosed)}}. Did you mean 
{{!isClosed}}?
# ElectionContext has {{getReplicas(EnumSet.of(Replica.Type.TLOG, 
Replica.Type.TLOG)}}. Perhaps you meant TLOG and NRT?
# ElectionContext has replaced shouldIBeLeader() which has a check for last 
published state being active. I'm curious if there can be a condition where the 
term is not registered and last published state is not active and therefore the 
replica becomes a leader.
# PrepRecoveryOp has refreshes terms if 
{{shardTerms.skipSendingUpdatesTo(coreNodeName)}} return true. But should it 
not wait for skip status to go away in a loop. The reason behind PrepRecovery 
is that we ensure that when the call to prep recovery returns, the leader has 
already seen the {{waitForState}} state and therefore is already forwarding the 
updates to the recoverying replica. Now that the behavior is changed to forward 
updates only after the term is equal and not depend on seeing 'recoverying' 
state, we should change PrepRecovery as well.
# Add a comment before calling {{getShardTerms(collection, 
shardId).startRecovering(coreNodeName);}} and {{getShardTerms(collection, 
shardId).doneRecovering(coreNodeName);}} in ZkController.publish() describing 
why it is necessary and why only PULL replicas are excluded. I understand the 
reason but it can be confusing to others reading this code

> Consistence problem when in-sync replicas are DOWN
> --------------------------------------------------
>
>                 Key: SOLR-12011
>                 URL: https://issues.apache.org/jira/browse/SOLR-12011
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: SolrCloud
>            Reporter: Cao Manh Dat
>            Assignee: Cao Manh Dat
>            Priority: Major
>         Attachments: SOLR-12011.patch
>
>
> Currently, we will meet consistency problem when in-sync replicas are DOWN. 
> For example:
> 1. A collection with 1 shard with 1 leader and 2 replicas
> 2. Nodes contain 2 replicas go down
> 3. The leader receives an update A, success
> 4. The node contains the leader goes down
> 5. 2 replicas come back
> 6. One of them become leader --> But they shouldn't become leader since they 
> missed the update A
> A solution to this issue :
> - The idea here is using term value of each replica will be enough to tell 
> that a replica received a latest updates or not



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Comment Edited] (SOLR-12011) Consistence problem when in-sync replicas are DOWN

Reply via email to