[ 
https://issues.apache.org/jira/browse/SOLR-12969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16683323#comment-16683323
 ] 

Cao Manh Dat edited comment on SOLR-12969 at 11/12/18 7:17 AM:
---------------------------------------------------------------

There are actually 2 problems that I think we should fix here:
 * Since PeerSyncWithLeader is the case replica sync with its leader, the 
result should not be ALREADY_IN_SYNC in case of fingerprint comparison failed.
 * PeerSync check for already_in_sync should be more robust "{{ourLowThreshold 
> otherHigh && ourHighest >= otherHighest}}" insteads of "{{ourLowThreshold > 
otherHigh}}". 

Attached a patch for fixing this problem.

 


was (Author: caomanhdat):
There are actually 2 problems that I think we should fix here:
 * Since PeerSyncWithLeader is the case replica sync with its leader, the 
result should not be ALREADY_IN_SYNC in case of fingerprint comparison failed.
 * PeerSync check for already_in_sync should be more robust "{{ourLowThreshold 
> otherHigh && ourHighest >= otherHighest}}" insteads of "{{ourLowThreshold > 
otherHigh}}. 

 

> Solr replication failure
> ------------------------
>
>                 Key: SOLR-12969
>                 URL: https://issues.apache.org/jira/browse/SOLR-12969
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: replication (java)
>            Reporter: Jeremy Smith
>            Assignee: Cao Manh Dat
>            Priority: Major
>         Attachments: SOLR-12969.patch, SOLR-12969.patch
>
>
> Under certain circumstances, replication fails between a leader and follower. 
>  The follower will not receive updates from the leader, even though the 
> leader has a newer version.  If the leader is restarted, it will get the 
> older version from the follower.
>  
> This was discussed on the [mailing 
> list|https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201810.mbox/%3CBYAPR04MB4406710795EA07E93BF80913ADCD0%40BYAPR04MB4406.namprd04.prod.outlook.com%3E]
>  and [~risdenk] [wrote a 
> script|https://github.com/risdenk/test-solr-start-stop-replica-consistency] 
> that demonstrates this error.  He also verified that the error occurs when 
> the script is run outside of docker.
>  
> Here is the scenario of the failure:
>  * A collection with 1 shards and 2 replicas
>  * Stop non-leader replica (B)
>  * Index more than 100 documents to the collection
>  * Start replica B, it failed to do PeerSync and starts segments replication
>  * Index document 101th to the collection
>  ** Leader's tlog: [1, 2, 3, ..., 100, 101]
>  ** Replica's tlog: [101]
>  * Stop replica B
>  * Index document 102th to the collection
>  * Start replica B, on doing PeerSync
>  ** Leader's tlog: [1, 2, 3, ..., 100, 101, 102]
>  ** Replica's tlog: [101]
>  ** Leader's high (80th): 80
>  ** Replica's low: 101
>  ** By comparison: replica's low > leader's high => ALREADY_IN_SYNC



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to