[jira] [Commented] (SOLR-8227) Recovering replicas should be able to recover from any active replica

Sandeep J (JIRA) Sat, 07 Nov 2015 14:55:30 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-8227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14995439#comment-14995439
 ]


Sandeep J commented on SOLR-8227:
---------------------------------

Very valid concerns/comments.

By design Solr is a CP model, so theoretically speaking we should be able to 
recover from any active replica. But to make it happen without leaving the 
system in an inconsistent state (i.e without compromising ‘C’) is an 
implementation level discussion in my opinion.

The main problem that we have seen in our prod environment is that during peak 
traffic if few nodes go into recovery and SnapPuller gets into action we get an 
‘Oh Snap’ moment :) So what I am trying to say is that Peer Syncs are light 
weight as compared to full replication.

I understand the co-ordination concept that Yonik mentioned that goes on 
between leader and recovering node, but that seems to apply for live updates 
isn’t it? Or I am missing something. In full replication, I would believe that 
index files(which is hard committed data) are being copied from source to 
destination, so if somehow this heavy duty operation can be offloaded from the 
leader it will help.

Also, full replication can take minutes to complete depending on the size of 
the index, traffic and network. In an environment where we have one leader and 
20 replicas, all the recovering nodes go to the leader who is also busy with 
reads/writes. During this time window of recovery the leader can also change or 
go into recovery, as mentioned by Tim. So even today after full replication 
from the leader the recovered node should perform some kind of sanity check 
with the latest leader, just to make sure that it has not missed any updates. 
Piggy backing on Ishan’s proposal.

If we put this sanity check or peer sync from the latest leader after the 
replication then even if we do replication from active replica I think we 
should be good. So here is the new refined proposal:

1. Recovering node gets an active replica (it could be leader)
2. After the peer sync, replication is started from the active replica found in 
#1
3. Once replication is complete, recovering node gets the leader node from the 
cluster state.
4. Recovering node performs a check with the leader.
5. If the number of missed updates is small such that there is no need for full 
recovery, recovering node takes the updates from the leader.
6. If the number of missed updates is still large and it needs full recovery, 
go back to step #1. Though likelihood of this scenario is less and it can be 
avoided by having a larger size of ‘UpdateLog’. Ref: 
https://issues.apache.org/jira/browse/SOLR-6359 .

Varun mentioned earlier that even during recovery phase the node gets updates 
from the leader, so the likelihood of running into #5, #6 seems bleak but it 
could happen with network partitions or long recovery times.

Now this proposal is based on my minimal knowledge of how a solr node 
identifies that it has missed some updates, if someone can shed some light on 
it that would really help me understand how step #4 takes place.

> Recovering replicas should be able to recover from any active replica
> ---------------------------------------------------------------------
>
>                 Key: SOLR-8227
>                 URL: https://issues.apache.org/jira/browse/SOLR-8227
>             Project: Solr
>          Issue Type: Improvement
>            Reporter: Varun Thacker
>
> Currently when a replica goes into recovery it uses the leader to recover. It 
> first   tries to do a PeerSync. If thats not successful it does a 
> replication. Most of the times it ends up doing a full replication because 
> segment merging, autoCommits causing segments to be formed differently on the 
> replicas ( We should explore improving that in another issue ) . 
> But when many replicas are recovering and hitting the leader, the leader can 
> become a bottleneck. Since Solr is a CP system , we should be able to recover 
> from any of the 'active' replicas instead of just the leader. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-8227) Recovering replicas should be able to recover from any active replica

Reply via email to