Timothy Potter created SOLR-5468:
------------------------------------

             Summary: Option to enforce a majority quorum approach to accepting 
updates in SolrCloud
                 Key: SOLR-5468
                 URL: https://issues.apache.org/jira/browse/SOLR-5468
             Project: Solr
          Issue Type: New Feature
          Components: SolrCloud
    Affects Versions: 4.5
         Environment: All
            Reporter: Timothy Potter
            Priority: Minor


I've been thinking about how SolrCloud deals with write-availability using 
in-sync replica sets, in which writes will continue to be accepted so long as 
there is at least one healthy node per shard.

For a little background (and to verify my understanding of the process is 
correct), SolrCloud only considers active/healthy replicas when acknowledging a 
write. Specifically, when a shard leader accepts an update request, it forwards 
the request to all active/healthy replicas and only considers the write 
successful if all active/healthy replicas ack the write. Any down / gone 
replicas are not considered and will sync up with the leader when they come 
back online using peer sync or snapshot replication. For instance, if a shard 
has 3 nodes, A, B, C with A being the current leader, then writes to the shard 
will continue to succeed even if B & C are down.

The issue is that if a shard leader continues to accept updates even if it 
loses all of its replicas, then we have acknowledged updates on only 1 node. If 
that node, call it A, then fails and one of the previous replicas, call it B, 
comes back online before A does, then any writes that A accepted while the 
other replicas were offline are at risk to being lost. 

SolrCloud does provide a safe-guard mechanism for this problem with the 
leaderVoteWait setting, which puts any replicas that come back online before 
node A into a temporary wait state. If A comes back online within the wait 
period, then all is well as it will become the leader again and no writes will 
be lost. As a side note, sys admins definitely need to be made more aware of 
this situation as when I first encountered it in my cluster, I had no idea what 
it meant.

My question is whether we want to consider an approach where SolrCloud will not 
accept writes unless there is a majority of replicas available to accept the 
write? For my example, under this approach, we wouldn't accept writes if both 
B&C failed, but would if only C did, leaving A & B online. Admittedly, this 
lowers the write-availability of the system, so may be something that should be 
tunable?

>From Mark M: Yeah, this is kind of like one of many little features that we 
>have just not gotten to yet. I’ve always planned for a param that let’s you 
>say how many replicas an update must be verified on before responding success. 
>Seems to make sense to fail that type of request early if you notice there are 
>not enough replicas up to satisfy the param to begin with.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to