[jira] [Commented] (SOLR-11216) Make PeerSync more robust
[ https://issues.apache.org/jira/browse/SOLR-11216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16516747#comment-16516747 ] Cao Manh Dat commented on SOLR-11216: - Attached a patch including the optimization metionted above, including more javadocs and TODO notes. I will beast the test with PeerSyncReplicationTest to make sure it works correct. If the result ok and no objections, I will commit the patch soon. > Make PeerSync more robust > - > > Key: SOLR-11216 > URL: https://issues.apache.org/jira/browse/SOLR-11216 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Cao Manh Dat >Priority: Major > Attachments: SOLR-11216.patch, SOLR-11216.patch, SOLR-11216.patch, > SOLR-11216.patch, SOLR-11216.patch, SOLR-11216.patch > > > First of all, I will change the issue's title with a better name when I have. > When digging into SOLR-10126. I found a case that can make peerSync fail. > * leader and replica receive update from 1 to 4 > * replica stop > * replica miss updates 5, 6 > * replica start recovery > ## replica buffer updates 7, 8 > ## replica request versions from leader, > ## in the same time leader receive update 9, so it will return updates from 1 > to 9 (for request versions) when replica get recent versions ( so it will be > 1,2,3,4,5,6,7,8,9 ) > ## replica do peersync and request updates 5, 6, 9 from leader > ## replica apply updates 5, 6, 9. Its index does not have update 7, 8 and > maxVersionSpecified for fingerprint is 9, therefore compare fingerprint will > fail > My idea here is why replica request update 9 (step 6) while it knows that > updates with lower version ( update 7, 8 ) are on its buffering tlog. Should > we request only updates that lower than the lowest update in its buffering > tlog ( < 7 )? > Someone my ask that what if replica won't receive update 9. In that case, > leader will put the replica into LIR state, so replica will run recovery > process again. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11216) Make PeerSync more robust
[ https://issues.apache.org/jira/browse/SOLR-11216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16515335#comment-16515335 ] Cao Manh Dat commented on SOLR-11216: - Attached a patch with refactoring. The code seems much cleaner now. > Make PeerSync more robust > - > > Key: SOLR-11216 > URL: https://issues.apache.org/jira/browse/SOLR-11216 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Cao Manh Dat >Priority: Major > Attachments: SOLR-11216.patch, SOLR-11216.patch, SOLR-11216.patch, > SOLR-11216.patch > > > First of all, I will change the issue's title with a better name when I have. > When digging into SOLR-10126. I found a case that can make peerSync fail. > * leader and replica receive update from 1 to 4 > * replica stop > * replica miss updates 5, 6 > * replica start recovery > ## replica buffer updates 7, 8 > ## replica request versions from leader, > ## in the same time leader receive update 9, so it will return updates from 1 > to 9 (for request versions) when replica get recent versions ( so it will be > 1,2,3,4,5,6,7,8,9 ) > ## replica do peersync and request updates 5, 6, 9 from leader > ## replica apply updates 5, 6, 9. Its index does not have update 7, 8 and > maxVersionSpecified for fingerprint is 9, therefore compare fingerprint will > fail > My idea here is why replica request update 9 (step 6) while it knows that > updates with lower version ( update 7, 8 ) are on its buffering tlog. Should > we request only updates that lower than the lowest update in its buffering > tlog ( < 7 )? > Someone my ask that what if replica won't receive update 9. In that case, > leader will put the replica into LIR state, so replica will run recovery > process again. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11216) Make PeerSync more robust
[ https://issues.apache.org/jira/browse/SOLR-11216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16514525#comment-16514525 ] Cao Manh Dat commented on SOLR-11216: - Thank guys for your reviews. This is a rough patch which needs to change/move things around to make it cleaner. To be more clear the process of the new PeerSync (PeerSyncWithLeader) is * Replica gets its recent updates versions * Replica requests recent updates versions + fingerprint from the leader * Replica requests missed updates (updates in buffer tlog are considered missed updates) up to leader's {{fingerprint.maxVersionEncountered}} * Replica apply missed updates then compare its fingerprint with leader's fingerprint in step 2 The reason for getting the fingerprint in step 2 is we do not trust {{fingerprint.maxVersionSpecified}}. Therefore we must use the fingerprint of the leader with {{fingerprint.maxVersionSpecified==Long.MAX_VALUE}} (or fingerprint of leader's index at the time of step 2). We may need to block updates between getting recent versions and computing fingerprint on the leader's side, but let do it later. By request updates up to {{fingerprint.maxVersionEncountered}}. We will make sure that after apply updates, {{replica.maxVersionEncountered}} will equal with the leader, hence its fingerprint will be the same as the leader. Another optimization here is on step 3, instead of considering buffered updates as missed updates, we just need to memo the buffered updates need to be applied on step 4. > Make PeerSync more robust > - > > Key: SOLR-11216 > URL: https://issues.apache.org/jira/browse/SOLR-11216 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Cao Manh Dat >Priority: Major > Attachments: SOLR-11216.patch, SOLR-11216.patch, SOLR-11216.patch > > > First of all, I will change the issue's title with a better name when I have. > When digging into SOLR-10126. I found a case that can make peerSync fail. > * leader and replica receive update from 1 to 4 > * replica stop > * replica miss updates 5, 6 > * replica start recovery > ## replica buffer updates 7, 8 > ## replica request versions from leader, > ## in the same time leader receive update 9, so it will return updates from 1 > to 9 (for request versions) when replica get recent versions ( so it will be > 1,2,3,4,5,6,7,8,9 ) > ## replica do peersync and request updates 5, 6, 9 from leader > ## replica apply updates 5, 6, 9. Its index does not have update 7, 8 and > maxVersionSpecified for fingerprint is 9, therefore compare fingerprint will > fail > My idea here is why replica request update 9 (step 6) while it knows that > updates with lower version ( update 7, 8 ) are on its buffering tlog. Should > we request only updates that lower than the lowest update in its buffering > tlog ( < 7 )? > Someone my ask that what if replica won't receive update 9. In that case, > leader will put the replica into LIR state, so replica will run recovery > process again. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11216) Make PeerSync more robust
[ https://issues.apache.org/jira/browse/SOLR-11216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16514244#comment-16514244 ] Yonik Seeley commented on SOLR-11216: - {quote} SolrQueryRequest req = new LocalSolrQueryRequest(core, new ModifiableSolrParams()); request is not safely closed, is this intentional? won't this break the reference count mechanism? {quote} Yeah, it does look like it should be closed. A SolrQueryRequest grabs a searcher reference on-demand, so that may be why it isn't causing an issue with any tests (the commit command doesn't grab a searcher reference with the provided request). It should be fixed anyway though. > Make PeerSync more robust > - > > Key: SOLR-11216 > URL: https://issues.apache.org/jira/browse/SOLR-11216 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Cao Manh Dat >Priority: Major > Attachments: SOLR-11216.patch, SOLR-11216.patch, SOLR-11216.patch > > > First of all, I will change the issue's title with a better name when I have. > When digging into SOLR-10126. I found a case that can make peerSync fail. > * leader and replica receive update from 1 to 4 > * replica stop > * replica miss updates 5, 6 > * replica start recovery > ## replica buffer updates 7, 8 > ## replica request versions from leader, > ## in the same time leader receive update 9, so it will return updates from 1 > to 9 (for request versions) when replica get recent versions ( so it will be > 1,2,3,4,5,6,7,8,9 ) > ## replica do peersync and request updates 5, 6, 9 from leader > ## replica apply updates 5, 6, 9. Its index does not have update 7, 8 and > maxVersionSpecified for fingerprint is 9, therefore compare fingerprint will > fail > My idea here is why replica request update 9 (step 6) while it knows that > updates with lower version ( update 7, 8 ) are on its buffering tlog. Should > we request only updates that lower than the lowest update in its buffering > tlog ( < 7 )? > Someone my ask that what if replica won't receive update 9. In that case, > leader will put the replica into LIR state, so replica will run recovery > process again. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11216) Make PeerSync more robust
[ https://issues.apache.org/jira/browse/SOLR-11216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16514102#comment-16514102 ] hamada commented on SOLR-11216: --- General review comments, some not related to the patch but relevant In general. PeerSyncWithLeader use startingVersions.isEmpty() rather than size() == 0, same for 215 The following try/finally can return, in which case proc is not closed, Is this intentional, and if so please add a comment to the effect line 299, consider sizing the List properly to avoid garbage side effect from growing the list, same applies to line 317 HttpShardHandler.java if (urls.size()==0) { with if (urls.isEmpty()) { RecoveryStrategy.java line 223 and 613, 235 (on core.getDeletionPolicy().getLatestCommit().getGeneration()) may result in an NPE line 436 SolrQueryRequest req = new LocalSolrQueryRequest(core, new ModifiableSolrParams()); request is not safely closed, is this intentional? won't this break the reference count mechanism? > Make PeerSync more robust > - > > Key: SOLR-11216 > URL: https://issues.apache.org/jira/browse/SOLR-11216 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Cao Manh Dat >Priority: Major > Attachments: SOLR-11216.patch, SOLR-11216.patch, SOLR-11216.patch > > > First of all, I will change the issue's title with a better name when I have. > When digging into SOLR-10126. I found a case that can make peerSync fail. > * leader and replica receive update from 1 to 4 > * replica stop > * replica miss updates 5, 6 > * replica start recovery > ## replica buffer updates 7, 8 > ## replica request versions from leader, > ## in the same time leader receive update 9, so it will return updates from 1 > to 9 (for request versions) when replica get recent versions ( so it will be > 1,2,3,4,5,6,7,8,9 ) > ## replica do peersync and request updates 5, 6, 9 from leader > ## replica apply updates 5, 6, 9. Its index does not have update 7, 8 and > maxVersionSpecified for fingerprint is 9, therefore compare fingerprint will > fail > My idea here is why replica request update 9 (step 6) while it knows that > updates with lower version ( update 7, 8 ) are on its buffering tlog. Should > we request only updates that lower than the lowest update in its buffering > tlog ( < 7 )? > Someone my ask that what if replica won't receive update 9. In that case, > leader will put the replica into LIR state, so replica will run recovery > process again. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11216) Make PeerSync more robust
[ https://issues.apache.org/jira/browse/SOLR-11216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16513887#comment-16513887 ] Cao Manh Dat commented on SOLR-11216: - Attached patch for Solution 2. Created a new class PeerSyncWithLeader with some duplications with its original class (PeerSync) but what we will gain here is an easier to understand flow (fewer flags) and optimized for doing peerSync on recovery. Any objections about this separations? [~shalinmangar] [~markrmil...@gmail.com] > Make PeerSync more robust > - > > Key: SOLR-11216 > URL: https://issues.apache.org/jira/browse/SOLR-11216 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Cao Manh Dat >Priority: Major > Attachments: SOLR-11216.patch, SOLR-11216.patch, SOLR-11216.patch > > > First of all, I will change the issue's title with a better name when I have. > When digging into SOLR-10126. I found a case that can make peerSync fail. > * leader and replica receive update from 1 to 4 > * replica stop > * replica miss updates 5, 6 > * replica start recovery > ## replica buffer updates 7, 8 > ## replica request versions from leader, > ## in the same time leader receive update 9, so it will return updates from 1 > to 9 (for request versions) when replica get recent versions ( so it will be > 1,2,3,4,5,6,7,8,9 ) > ## replica do peersync and request updates 5, 6, 9 from leader > ## replica apply updates 5, 6, 9. Its index does not have update 7, 8 and > maxVersionSpecified for fingerprint is 9, therefore compare fingerprint will > fail > My idea here is why replica request update 9 (step 6) while it knows that > updates with lower version ( update 7, 8 ) are on its buffering tlog. Should > we request only updates that lower than the lowest update in its buffering > tlog ( < 7 )? > Someone my ask that what if replica won't receive update 9. In that case, > leader will put the replica into LIR state, so replica will run recovery > process again. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11216) Make PeerSync more robust
[ https://issues.apache.org/jira/browse/SOLR-11216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16511005#comment-16511005 ] Cao Manh Dat commented on SOLR-11216: - Attached patch for anyone wants to reproduce problems of Solution 3. > Make PeerSync more robust > - > > Key: SOLR-11216 > URL: https://issues.apache.org/jira/browse/SOLR-11216 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Cao Manh Dat >Priority: Major > Attachments: SOLR-11216.patch, SOLR-11216.patch > > > First of all, I will change the issue's title with a better name when I have. > When digging into SOLR-10126. I found a case that can make peerSync fail. > * leader and replica receive update from 1 to 4 > * replica stop > * replica miss updates 5, 6 > * replica start recovery > ## replica buffer updates 7, 8 > ## replica request versions from leader, > ## in the same time leader receive update 9, so it will return updates from 1 > to 9 (for request versions) when replica get recent versions ( so it will be > 1,2,3,4,5,6,7,8,9 ) > ## replica do peersync and request updates 5, 6, 9 from leader > ## replica apply updates 5, 6, 9. Its index does not have update 7, 8 and > maxVersionSpecified for fingerprint is 9, therefore compare fingerprint will > fail > My idea here is why replica request update 9 (step 6) while it knows that > updates with lower version ( update 7, 8 ) are on its buffering tlog. Should > we request only updates that lower than the lowest update in its buffering > tlog ( < 7 )? > Someone my ask that what if replica won't receive update 9. In that case, > leader will put the replica into LIR state, so replica will run recovery > process again. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11216) Make PeerSync more robust
[ https://issues.apache.org/jira/browse/SOLR-11216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16511004#comment-16511004 ] Cao Manh Dat commented on SOLR-11216: - After spent a day adding more test and debugging problem. I think that with the current IndexFingerprint implementation we can't go with Solution 3. Firstly, to go with Solution 3, we must compute the fingerprint of the index up to a specified point. But just by looking at the current index, we can't do that. Ie: A leader : - with updates: doc1(v=0), doc2(v=1), doc3(v=3), delete(doc3, v=4), doc2(v=5). - its index will be: doc1(v=0), doc2(v=5) A replica : - with index: doc1(v=0), doc2(v=1) Case 1: A replica asks for updates and fingerprint up to (include) v=3. The Leader will return updates doc3(v=3) - leader's fingerprint will be hash of doc1(v=0) (it will skip doc2, since its version = 5 > specified version 3) - replica' fingerprint will be hash of doc1(v=0), doc2(v=1), doc3(v=3) -> incorrect fingerprint. There are many other cases which are very tricky to solve. Therefore I think the best thing to do now is Solution 2. > Make PeerSync more robust > - > > Key: SOLR-11216 > URL: https://issues.apache.org/jira/browse/SOLR-11216 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Cao Manh Dat >Priority: Major > Attachments: SOLR-11216.patch > > > First of all, I will change the issue's title with a better name when I have. > When digging into SOLR-10126. I found a case that can make peerSync fail. > * leader and replica receive update from 1 to 4 > * replica stop > * replica miss updates 5, 6 > * replica start recovery > ## replica buffer updates 7, 8 > ## replica request versions from leader, > ## in the same time leader receive update 9, so it will return updates from 1 > to 9 (for request versions) when replica get recent versions ( so it will be > 1,2,3,4,5,6,7,8,9 ) > ## replica do peersync and request updates 5, 6, 9 from leader > ## replica apply updates 5, 6, 9. Its index does not have update 7, 8 and > maxVersionSpecified for fingerprint is 9, therefore compare fingerprint will > fail > My idea here is why replica request update 9 (step 6) while it knows that > updates with lower version ( update 7, 8 ) are on its buffering tlog. Should > we request only updates that lower than the lowest update in its buffering > tlog ( < 7 )? > Someone my ask that what if replica won't receive update 9. In that case, > leader will put the replica into LIR state, so replica will run recovery > process again. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11216) Make PeerSync more robust
[ https://issues.apache.org/jira/browse/SOLR-11216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16509971#comment-16509971 ] hamada commented on SOLR-11216: --- from 20,000 foot level. Any time based solution is just brittle, solution 2 sounds like a workaround. solution 3 seems to fit the bill. > Make PeerSync more robust > - > > Key: SOLR-11216 > URL: https://issues.apache.org/jira/browse/SOLR-11216 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Cao Manh Dat >Priority: Major > Attachments: SOLR-11216.patch > > > First of all, I will change the issue's title with a better name when I have. > When digging into SOLR-10126. I found a case that can make peerSync fail. > * leader and replica receive update from 1 to 4 > * replica stop > * replica miss updates 5, 6 > * replica start recovery > ## replica buffer updates 7, 8 > ## replica request versions from leader, > ## in the same time leader receive update 9, so it will return updates from 1 > to 9 (for request versions) when replica get recent versions ( so it will be > 1,2,3,4,5,6,7,8,9 ) > ## replica do peersync and request updates 5, 6, 9 from leader > ## replica apply updates 5, 6, 9. Its index does not have update 7, 8 and > maxVersionSpecified for fingerprint is 9, therefore compare fingerprint will > fail > My idea here is why replica request update 9 (step 6) while it knows that > updates with lower version ( update 7, 8 ) are on its buffering tlog. Should > we request only updates that lower than the lowest update in its buffering > tlog ( < 7 )? > Someone my ask that what if replica won't receive update 9. In that case, > leader will put the replica into LIR state, so replica will run recovery > process again. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11216) Make PeerSync more robust
[ https://issues.apache.org/jira/browse/SOLR-11216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16509691#comment-16509691 ] Cao Manh Dat commented on SOLR-11216: - Attached a patch for this ticket, based on Solution 3. It needs more test, but the overall result seems good. > Make PeerSync more robust > - > > Key: SOLR-11216 > URL: https://issues.apache.org/jira/browse/SOLR-11216 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Cao Manh Dat >Priority: Major > Attachments: SOLR-11216.patch > > > First of all, I will change the issue's title with a better name when I have. > When digging into SOLR-10126. I found a case that can make peerSync fail. > * leader and replica receive update from 1 to 4 > * replica stop > * replica miss updates 5, 6 > * replica start recovery > ## replica buffer updates 7, 8 > ## replica request versions from leader, > ## in the same time leader receive update 9, so it will return updates from 1 > to 9 (for request versions) when replica get recent versions ( so it will be > 1,2,3,4,5,6,7,8,9 ) > ## replica do peersync and request updates 5, 6, 9 from leader > ## replica apply updates 5, 6, 9. Its index does not have update 7, 8 and > maxVersionSpecified for fingerprint is 9, therefore compare fingerprint will > fail > My idea here is why replica request update 9 (step 6) while it knows that > updates with lower version ( update 7, 8 ) are on its buffering tlog. Should > we request only updates that lower than the lowest update in its buffering > tlog ( < 7 )? > Someone my ask that what if replica won't receive update 9. In that case, > leader will put the replica into LIR state, so replica will run recovery > process again. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11216) Make PeerSync more robust
[ https://issues.apache.org/jira/browse/SOLR-11216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16507649#comment-16507649 ] Cao Manh Dat commented on SOLR-11216: - The problem here relates to "on-wire" updates which get returned in "getVersions" request but do not present in replica's buffering tlog. Here are couples of solution * Solution 1: After submitting "getVersions" request, the replica will wait for some time. Therefore "on-wire" updates will land on buffering tlog. This is the simplest solution but less robust than solution 2. * Solution 2: On finding missed updates, the replica will consider buffered updates as missed one. Hence will request these updates from the leader and apply them to its local index -> It will make the fingerprint comparison success. * Solution 3: On finding missed updates, the replica will consider any updates with version larger than minVersion(buffered updates) are non-missed updates (the "on-wire" updates will be filled on applyBufferedUpdates() call). We only do fingerprint comparison up-to minVersion(buffered updates). [~praste] Yeah, that case will be very tricky to solve, but at least we should solve some common cases. > Make PeerSync more robust > - > > Key: SOLR-11216 > URL: https://issues.apache.org/jira/browse/SOLR-11216 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Cao Manh Dat >Priority: Major > > First of all, I will change the issue's title with a better name when I have. > When digging into SOLR-10126. I found a case that can make peerSync fail. > * leader and replica receive update from 1 to 4 > * replica stop > * replica miss updates 5, 6 > * replica start recovery > ## replica buffer updates 7, 8 > ## replica request versions from leader, > ## in the same time leader receive update 9, so it will return updates from 1 > to 9 (for request versions) when replica get recent versions ( so it will be > 1,2,3,4,5,6,7,8,9 ) > ## replica do peersync and request updates 5, 6, 9 from leader > ## replica apply updates 5, 6, 9. Its index does not have update 7, 8 and > maxVersionSpecified for fingerprint is 9, therefore compare fingerprint will > fail > My idea here is why replica request update 9 (step 6) while it knows that > updates with lower version ( update 7, 8 ) are on its buffering tlog. Should > we request only updates that lower than the lowest update in its buffering > tlog ( < 7 )? > Someone my ask that what if replica won't receive update 9. In that case, > leader will put the replica into LIR state, so replica will run recovery > process again. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11216) Make PeerSync more robust
[ https://issues.apache.org/jira/browse/SOLR-11216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16260117#comment-16260117 ] Pushkar Raste commented on SOLR-11216: -- [~caomanhdat] Can this fail if the leader processes updates out of order e.g. what if leader processed updates in the order 6 and has yet to process 5. Now the replica requests update 6. However, leader has just finished processing 5 (including a soft/hard commit) and when leader calculates index fingerprint up to 6, the leader's fingerprint will include version 5 as well. Considering all the race conditions, I think making fingerprint robust is tricky. > Make PeerSync more robust > - > > Key: SOLR-11216 > URL: https://issues.apache.org/jira/browse/SOLR-11216 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Cao Manh Dat > > First of all, I will change the issue's title with a better name when I have. > When digging into SOLR-10126. I found a case that can make peerSync fail. > * leader and replica receive update from 1 to 4 > * replica stop > * replica miss updates 5, 6 > * replica start recovery > ## replica buffer updates 7, 8 > ## replica request versions from leader, > ## replica get recent versions which is 1,2,3,4,7,8 > ## in the same time leader receive update 9, so it will return updates from 1 > to 9 (for request versions) > ## replica do peersync and request updates 5, 6, 9 from leader > ## replica apply updates 5, 6, 9. Its index does not have update 7, 8 and > maxVersionSpecified for fingerprint is 9, therefore compare fingerprint will > fail > My idea here is why replica request update 9 (step 6) while it knows that > updates with lower version ( update 7, 8 ) are on its buffering tlog. Should > we request only updates that lower than the lowest update in its buffering > tlog ( < 7 )? > Someone my ask that what if replica won't receive update 9. In that case, > leader will put the replica into LIR state, so replica will run recovery > process again. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org