[
https://issues.apache.org/jira/browse/SOLR-11216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16514525#comment-16514525
]
Cao Manh Dat commented on SOLR-11216:
-------------------------------------
Thank guys for your reviews. This is a rough patch which needs to change/move
things around to make it cleaner. To be more clear the process of the new
PeerSync (PeerSyncWithLeader) is
* Replica gets its recent updates versions
* Replica requests recent updates versions + fingerprint from the leader
* Replica requests missed updates (updates in buffer tlog are considered missed
updates) up to leader's {{fingerprint.maxVersionEncountered}}
* Replica apply missed updates then compare its fingerprint with leader's
fingerprint in step 2
The reason for getting the fingerprint in step 2 is we do not trust
{{fingerprint.maxVersionSpecified}}. Therefore we must use the fingerprint of
the leader with {{fingerprint.maxVersionSpecified==Long.MAX_VALUE}} (or
fingerprint of leader's index at the time of step 2). We may need to block
updates between getting recent versions and computing fingerprint on the
leader's side, but let do it later.
By request updates up to {{fingerprint.maxVersionEncountered}}. We will make
sure that after apply updates, {{replica.maxVersionEncountered}} will equal
with the leader, hence its fingerprint will be the same as the leader.
Another optimization here is on step 3, instead of considering buffered updates
as missed updates, we just need to memo the buffered updates need to be applied
on step 4.
> Make PeerSync more robust
> -------------------------
>
> Key: SOLR-11216
> URL: https://issues.apache.org/jira/browse/SOLR-11216
> Project: Solr
> Issue Type: Improvement
> Security Level: Public(Default Security Level. Issues are Public)
> Reporter: Cao Manh Dat
> Priority: Major
> Attachments: SOLR-11216.patch, SOLR-11216.patch, SOLR-11216.patch
>
>
> First of all, I will change the issue's title with a better name when I have.
> When digging into SOLR-10126. I found a case that can make peerSync fail.
> * leader and replica receive update from 1 to 4
> * replica stop
> * replica miss updates 5, 6
> * replica start recovery
> ## replica buffer updates 7, 8
> ## replica request versions from leader,
> ## in the same time leader receive update 9, so it will return updates from 1
> to 9 (for request versions) when replica get recent versions ( so it will be
> 1,2,3,4,5,6,7,8,9 )
> ## replica do peersync and request updates 5, 6, 9 from leader
> ## replica apply updates 5, 6, 9. Its index does not have update 7, 8 and
> maxVersionSpecified for fingerprint is 9, therefore compare fingerprint will
> fail
> My idea here is why replica request update 9 (step 6) while it knows that
> updates with lower version ( update 7, 8 ) are on its buffering tlog. Should
> we request only updates that lower than the lowest update in its buffering
> tlog ( < 7 )?
> Someone my ask that what if replica won't receive update 9. In that case,
> leader will put the replica into LIR state, so replica will run recovery
> process again.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]