[ https://issues.apache.org/jira/browse/SOLR-12969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16683323#comment-16683323 ]
Cao Manh Dat edited comment on SOLR-12969 at 11/12/18 7:17 AM: --------------------------------------------------------------- There are actually 2 problems that I think we should fix here: * Since PeerSyncWithLeader is the case replica sync with its leader, the result should not be ALREADY_IN_SYNC in case of fingerprint comparison failed. * PeerSync check for already_in_sync should be more robust "{{ourLowThreshold > otherHigh && ourHighest >= otherHighest}}" insteads of "{{ourLowThreshold > otherHigh}}". Attached a patch for fixing this problem. was (Author: caomanhdat): There are actually 2 problems that I think we should fix here: * Since PeerSyncWithLeader is the case replica sync with its leader, the result should not be ALREADY_IN_SYNC in case of fingerprint comparison failed. * PeerSync check for already_in_sync should be more robust "{{ourLowThreshold > otherHigh && ourHighest >= otherHighest}}" insteads of "{{ourLowThreshold > otherHigh}}. > Solr replication failure > ------------------------ > > Key: SOLR-12969 > URL: https://issues.apache.org/jira/browse/SOLR-12969 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: replication (java) > Reporter: Jeremy Smith > Assignee: Cao Manh Dat > Priority: Major > Attachments: SOLR-12969.patch, SOLR-12969.patch > > > Under certain circumstances, replication fails between a leader and follower. > The follower will not receive updates from the leader, even though the > leader has a newer version. If the leader is restarted, it will get the > older version from the follower. > > This was discussed on the [mailing > list|https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201810.mbox/%3CBYAPR04MB4406710795EA07E93BF80913ADCD0%40BYAPR04MB4406.namprd04.prod.outlook.com%3E] > and [~risdenk] [wrote a > script|https://github.com/risdenk/test-solr-start-stop-replica-consistency] > that demonstrates this error. He also verified that the error occurs when > the script is run outside of docker. > > Here is the scenario of the failure: > * A collection with 1 shards and 2 replicas > * Stop non-leader replica (B) > * Index more than 100 documents to the collection > * Start replica B, it failed to do PeerSync and starts segments replication > * Index document 101th to the collection > ** Leader's tlog: [1, 2, 3, ..., 100, 101] > ** Replica's tlog: [101] > * Stop replica B > * Index document 102th to the collection > * Start replica B, on doing PeerSync > ** Leader's tlog: [1, 2, 3, ..., 100, 101, 102] > ** Replica's tlog: [101] > ** Leader's high (80th): 80 > ** Replica's low: 101 > ** By comparison: replica's low > leader's high => ALREADY_IN_SYNC -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org