[jira] [Commented] (SOLR-12969) Solr replication failure
[ https://issues.apache.org/jira/browse/SOLR-12969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16683475#comment-16683475 ] Cao Manh Dat commented on SOLR-12969: - Actually, I have another idea for not relying on tlog for PeerSync. This idea requires 2 changes * Versions should be monotonic increased (instead of relying on clock time). * Each commit point will contains the compressed format of all the updates versions had been applied. I.E: 1..100,103..1000. * Therefore based on above changes, we can quickly compare version that we missed without caring about compute fingerprint or relying on tlog. * We still need to think about cases with FORCELEADER or similar cases, i.e: different updates have same update version. But this approach requires a lot of changes and I'm not sure it is worth or not. What we gain here by going with above approach is * Faster and more reliable PeerSync process. * We don't have to compute fingerprint any more. May be one or two more cases will convince me on switching to above approach. Any thoughts? [~shalinmangar] [~markrmil...@gmail.com] [~ysee...@gmail.com] > Solr replication failure > > > Key: SOLR-12969 > URL: https://issues.apache.org/jira/browse/SOLR-12969 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: replication (java) >Reporter: Jeremy Smith >Assignee: Cao Manh Dat >Priority: Major > Attachments: SOLR-12969.patch, SOLR-12969.patch, SOLR-12969.patch > > > Under certain circumstances, replication fails between a leader and follower. > The follower will not receive updates from the leader, even though the > leader has a newer version. If the leader is restarted, it will get the > older version from the follower. > > This was discussed on the [mailing > list|https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201810.mbox/%3CBYAPR04MB4406710795EA07E93BF80913ADCD0%40BYAPR04MB4406.namprd04.prod.outlook.com%3E] > and [~risdenk] [wrote a > script|https://github.com/risdenk/test-solr-start-stop-replica-consistency] > that demonstrates this error. He also verified that the error occurs when > the script is run outside of docker. > > Here is the scenario of the failure: > * A collection with 1 shards and 2 replicas > * Stop non-leader replica (B) > * Index more than 100 documents to the collection > * Start replica B, it failed to do PeerSync and starts segments replication > * Index document 101th to the collection > ** Leader's tlog: [1, 2, 3, ..., 100, 101] > ** Replica's tlog: [101] > * Stop replica B > * Index document 102th to the collection > * Start replica B, on doing PeerSync > ** Leader's tlog: [1, 2, 3, ..., 100, 101, 102] > ** Replica's tlog: [101] > ** Leader's high (80th): 80 > ** Replica's low: 101 > ** By comparison: replica's low > leader's high => ALREADY_IN_SYNC -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-12969) Solr replication failure
[ https://issues.apache.org/jira/browse/SOLR-12969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16683323#comment-16683323 ] Cao Manh Dat commented on SOLR-12969: - There are actually 2 problems that I think we should fix here: * Since PeerSyncWithLeader is the case replica sync with its leader, the result should not be ALREADY_IN_SYNC in case of fingerprint comparison failed. * PeerSync check for already_in_sync should be more robust "{{ourLowThreshold > otherHigh && ourHighest >= otherHighest}}" insteads of "{{ourLowThreshold > otherHigh}}. > Solr replication failure > > > Key: SOLR-12969 > URL: https://issues.apache.org/jira/browse/SOLR-12969 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: replication (java) >Reporter: Jeremy Smith >Assignee: Cao Manh Dat >Priority: Major > Attachments: SOLR-12969.patch, SOLR-12969.patch > > > Under certain circumstances, replication fails between a leader and follower. > The follower will not receive updates from the leader, even though the > leader has a newer version. If the leader is restarted, it will get the > older version from the follower. > > This was discussed on the [mailing > list|https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201810.mbox/%3CBYAPR04MB4406710795EA07E93BF80913ADCD0%40BYAPR04MB4406.namprd04.prod.outlook.com%3E] > and [~risdenk] [wrote a > script|https://github.com/risdenk/test-solr-start-stop-replica-consistency] > that demonstrates this error. He also verified that the error occurs when > the script is run outside of docker. > > Here is the scenario of the failure: > * A collection with 1 shards and 2 replicas > * Stop non-leader replica (B) > * Index more than 100 documents to the collection > * Start replica B, it failed to do PeerSync and starts segments replication > * Index document 101th to the collection > ** Leader's tlog: [1, 2, 3, ..., 100, 101] > ** Replica's tlog: [101] > * Stop replica B > * Index document 102th to the collection > * Start replica B, on doing PeerSync > ** Leader's tlog: [1, 2, 3, ..., 100, 101, 102] > ** Replica's tlog: [101] > ** Leader's high (80th): 80 > ** Replica's low: 101 > ** By comparison: replica's low > leader's high => ALREADY_IN_SYNC -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-12969) Solr replication failure
[ https://issues.apache.org/jira/browse/SOLR-12969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16679971#comment-16679971 ] Cao Manh Dat commented on SOLR-12969: - I confirm that this problem happens on different versions of Solr (maybe from the start since this is problem of PeerSync logic). Attached patch here that fix the problem, but I really don't like the way we current do PeerSync. I may think about this for couple of days before committing the patch. P/S: The scenario when this failure happen is tricky and this bug appear with the appearance of SolrCloud so we don't have to rush about it. > Solr replication failure > > > Key: SOLR-12969 > URL: https://issues.apache.org/jira/browse/SOLR-12969 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: replication (java) >Reporter: Jeremy Smith >Assignee: Cao Manh Dat >Priority: Major > Attachments: SOLR-12969.patch > > > Under certain circumstances, replication fails between a leader and follower. > The follower will not receive updates from the leader, even though the > leader has a newer version. If the leader is restarted, it will get the > older version from the follower. > > This was discussed on the [mailing > list|https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201810.mbox/%3CBYAPR04MB4406710795EA07E93BF80913ADCD0%40BYAPR04MB4406.namprd04.prod.outlook.com%3E] > and [~risdenk] [wrote a > script|https://github.com/risdenk/test-solr-start-stop-replica-consistency] > that demonstrates this error. He also verified that the error occurs when > the script is run outside of docker. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-12969) Solr replication failure
[ https://issues.apache.org/jira/browse/SOLR-12969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16677277#comment-16677277 ] Kevin Risden commented on SOLR-12969: - [~caomanhdat] - Do you have any ideas here? I think you did work on replication. > Solr replication failure > > > Key: SOLR-12969 > URL: https://issues.apache.org/jira/browse/SOLR-12969 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: replication (java) >Reporter: Jeremy Smith >Priority: Major > > Under certain circumstances, replication fails between a leader and follower. > The follower will not receive updates from the leader, even though the > leader has a newer version. If the leader is restarted, it will get the > older version from the follower. > > This was discussed on the [mailing > list|https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201810.mbox/%3CBYAPR04MB4406710795EA07E93BF80913ADCD0%40BYAPR04MB4406.namprd04.prod.outlook.com%3E] > and [~risdenk] [wrote a > script|https://github.com/risdenk/test-solr-start-stop-replica-consistency] > that demonstrates this error. He also verified that the error occurs when > the script is run outside of docker. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org