[ 
https://issues.apache.org/jira/browse/SOLR-11216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16514525#comment-16514525
 ] 

Cao Manh Dat commented on SOLR-11216:
-------------------------------------

Thank guys for your reviews. This is a rough patch which needs to change/move 
things around to make it cleaner. To be more clear the process of the new 
PeerSync (PeerSyncWithLeader) is
* Replica gets its recent updates versions
* Replica requests recent updates versions + fingerprint from the leader
* Replica requests missed updates (updates in buffer tlog are considered missed 
updates) up to leader's {{fingerprint.maxVersionEncountered}}
* Replica apply missed updates then compare its fingerprint with leader's 
fingerprint in step 2

The reason for getting the fingerprint in step 2 is we do not trust 
{{fingerprint.maxVersionSpecified}}. Therefore we must use the fingerprint of 
the leader with {{fingerprint.maxVersionSpecified==Long.MAX_VALUE}} (or 
fingerprint of leader's index at the time of step 2). We may need to block 
updates between getting recent versions and computing fingerprint on the 
leader's side, but let do it later.

By request updates up to {{fingerprint.maxVersionEncountered}}. We will make 
sure that after apply updates, {{replica.maxVersionEncountered}} will equal 
with the leader, hence its fingerprint will be the same as the leader.

Another optimization here is on step 3, instead of considering buffered updates 
as missed updates, we just need to memo the buffered updates need to be applied 
on step 4.





> Make PeerSync more robust
> -------------------------
>
>                 Key: SOLR-11216
>                 URL: https://issues.apache.org/jira/browse/SOLR-11216
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Cao Manh Dat
>            Priority: Major
>         Attachments: SOLR-11216.patch, SOLR-11216.patch, SOLR-11216.patch
>
>
> First of all, I will change the issue's title with a better name when I have.
> When digging into SOLR-10126. I found a case that can make peerSync fail.
> * leader and replica receive update from 1 to 4
> * replica stop
> * replica miss updates 5, 6
> * replica start recovery
> ## replica buffer updates 7, 8
> ## replica request versions from leader, 
> ## in the same time leader receive update 9, so it will return updates from 1 
> to 9 (for request versions) when replica get recent versions ( so it will be 
> 1,2,3,4,5,6,7,8,9 )
> ## replica do peersync and request updates 5, 6, 9 from leader 
> ## replica apply updates 5, 6, 9. Its index does not have update 7, 8 and 
> maxVersionSpecified for fingerprint is 9, therefore compare fingerprint will 
> fail
> My idea here is why replica request update 9 (step 6) while it knows that 
> updates with lower version ( update 7, 8 ) are on its buffering tlog. Should 
> we request only updates that lower than the lowest update in its buffering 
> tlog ( < 7 )?
> Someone my ask that what if replica won't receive update 9. In that case, 
> leader will put the replica into LIR state, so replica will run recovery 
> process again.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to