[ 
https://issues.apache.org/jira/browse/SOLR-5468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Timothy Potter updated SOLR-5468:
---------------------------------

    Attachment: SOLR-5468.patch

Here is a patch that should be classified as exploratory / discovery into this 
topic. It has a little overlap with the patch I posted for SOLR-5495, but not 
to worry on that as I plan to commit SOLR-5495 first so the overlap will get 
worked out shortly.

As I dug into this idea in more detail, it became pretty clear that what we can 
accomplish in the area of providing stronger enforcement of replication during 
update processing is fairly limited by our architecture. Of course this is not 
a criticism of the current architecture as I feel it’s fundamentally sound.

The underlying concept in this feature is a client application wants to ensure 
a write succeeds on more than just the leader. For instance in a collection 
with RF=3, the client may want to say don’t consider an update request is 
successful unless it succeeds on 2 of the 3 replicas, vs. how it works today is 
an update request is considered successful if it succeeds on the leader only. 
The problem is that there’s no consistent way to “back out” an update without 
some sort of distributed transaction coordination among the replicas, which I’m 
pretty sure we don’t want to even go down that road. Backing out an update 
seems doable (potentially) if we’re just talking about the leader but what 
happens when the client wants minRF=3 and the update only works on the leader 
and one of the replicas? Now we’re needing to back out an update from the 
leader and one of the replicas. Gets ugly fast …

So what is accomplished in this patch? First, a client application has the 
ability to request information back from the cluster on what replication factor 
was achieved for an update request by sending the min_rf parameter in the 
request. This is the hint to the leader to keep track of the success or failure 
of the request on each replica. As that implies some waiting to see the result, 
the client can also send the max_rfwait parameter that tells the leader how 
long it should wait to collect the results from the replicas (default is 5 
seconds). This is captured in the ReplicationFactorTest class.

This can be useful for client applications that have idempotent updates and 
thus decide to retry the updates if the desired replication factor was not 
achieved. What we can’t do is fail the request if the desired min_rf is not 
achieved as that leads to the aforementioned backing out issues. There is one 
case where we can fail the request and avoid the backing out issue is if we 
know the min_rf can’t be achieved before we do the write locally on the leader 
first. This patch doesn’t have that solution in place as I wasn’t sure if 
that’s desired? If so, it will be easy to add that. Also, this patch doesn’t 
have anything in place for batch processing, ie. only works with single update 
requests as I wanted to get some feedback before going down that path any 
further. Moreover, there’s a pretty high cost in terms of slowing down update 
request processing in SolrCloud by having the leader block until it knows the 
result of the request on the replicas. In other words, this functionality is 
not for free but may still be useful for some applications?

To me, at the end of the day, what’s really needed is to ensure that any update 
requests that were ack’d back to the client are not lost. This could happen 
under the current architecture if the leader is the only replica that has a 
write and then fails and doesn’t recover before another replica recovers and 
resumes the leadership role (after waiting too long to see the previous leader 
come back). Thus, from where I sit, our efforts are better spent on continuing 
to harden the leader failover and recovery processes and applications needing 
stronger guarantees should have more replicas. SolrCloud should then just focus 
on only allowing in-sync replicas to become the leader using strategies like 
what was provided with SOLR-5495.

> Option to enforce a majority quorum approach to accepting updates in SolrCloud
> ------------------------------------------------------------------------------
>
>                 Key: SOLR-5468
>                 URL: https://issues.apache.org/jira/browse/SOLR-5468
>             Project: Solr
>          Issue Type: New Feature
>          Components: SolrCloud
>    Affects Versions: 4.5
>         Environment: All
>            Reporter: Timothy Potter
>            Assignee: Timothy Potter
>            Priority: Minor
>         Attachments: SOLR-5468.patch
>
>
> I've been thinking about how SolrCloud deals with write-availability using 
> in-sync replica sets, in which writes will continue to be accepted so long as 
> there is at least one healthy node per shard.
> For a little background (and to verify my understanding of the process is 
> correct), SolrCloud only considers active/healthy replicas when acknowledging 
> a write. Specifically, when a shard leader accepts an update request, it 
> forwards the request to all active/healthy replicas and only considers the 
> write successful if all active/healthy replicas ack the write. Any down / 
> gone replicas are not considered and will sync up with the leader when they 
> come back online using peer sync or snapshot replication. For instance, if a 
> shard has 3 nodes, A, B, C with A being the current leader, then writes to 
> the shard will continue to succeed even if B & C are down.
> The issue is that if a shard leader continues to accept updates even if it 
> loses all of its replicas, then we have acknowledged updates on only 1 node. 
> If that node, call it A, then fails and one of the previous replicas, call it 
> B, comes back online before A does, then any writes that A accepted while the 
> other replicas were offline are at risk to being lost. 
> SolrCloud does provide a safe-guard mechanism for this problem with the 
> leaderVoteWait setting, which puts any replicas that come back online before 
> node A into a temporary wait state. If A comes back online within the wait 
> period, then all is well as it will become the leader again and no writes 
> will be lost. As a side note, sys admins definitely need to be made more 
> aware of this situation as when I first encountered it in my cluster, I had 
> no idea what it meant.
> My question is whether we want to consider an approach where SolrCloud will 
> not accept writes unless there is a majority of replicas available to accept 
> the write? For my example, under this approach, we wouldn't accept writes if 
> both B&C failed, but would if only C did, leaving A & B online. Admittedly, 
> this lowers the write-availability of the system, so may be something that 
> should be tunable?
> From Mark M: Yeah, this is kind of like one of many little features that we 
> have just not gotten to yet. I’ve always planned for a param that let’s you 
> say how many replicas an update must be verified on before responding 
> success. Seems to make sense to fail that type of request early if you notice 
> there are not enough replicas up to satisfy the param to begin with.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to