[
https://issues.apache.org/jira/browse/SOLR-5468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Timothy Potter updated SOLR-5468:
---------------------------------
Attachment: SOLR-5468.patch
Here is a patch that should be classified as exploratory / discovery into this
topic. It has a little overlap with the patch I posted for SOLR-5495, but not
to worry on that as I plan to commit SOLR-5495 first so the overlap will get
worked out shortly.
As I dug into this idea in more detail, it became pretty clear that what we can
accomplish in the area of providing stronger enforcement of replication during
update processing is fairly limited by our architecture. Of course this is not
a criticism of the current architecture as I feel it’s fundamentally sound.
The underlying concept in this feature is a client application wants to ensure
a write succeeds on more than just the leader. For instance in a collection
with RF=3, the client may want to say don’t consider an update request is
successful unless it succeeds on 2 of the 3 replicas, vs. how it works today is
an update request is considered successful if it succeeds on the leader only.
The problem is that there’s no consistent way to “back out” an update without
some sort of distributed transaction coordination among the replicas, which I’m
pretty sure we don’t want to even go down that road. Backing out an update
seems doable (potentially) if we’re just talking about the leader but what
happens when the client wants minRF=3 and the update only works on the leader
and one of the replicas? Now we’re needing to back out an update from the
leader and one of the replicas. Gets ugly fast …
So what is accomplished in this patch? First, a client application has the
ability to request information back from the cluster on what replication factor
was achieved for an update request by sending the min_rf parameter in the
request. This is the hint to the leader to keep track of the success or failure
of the request on each replica. As that implies some waiting to see the result,
the client can also send the max_rfwait parameter that tells the leader how
long it should wait to collect the results from the replicas (default is 5
seconds). This is captured in the ReplicationFactorTest class.
This can be useful for client applications that have idempotent updates and
thus decide to retry the updates if the desired replication factor was not
achieved. What we can’t do is fail the request if the desired min_rf is not
achieved as that leads to the aforementioned backing out issues. There is one
case where we can fail the request and avoid the backing out issue is if we
know the min_rf can’t be achieved before we do the write locally on the leader
first. This patch doesn’t have that solution in place as I wasn’t sure if
that’s desired? If so, it will be easy to add that. Also, this patch doesn’t
have anything in place for batch processing, ie. only works with single update
requests as I wanted to get some feedback before going down that path any
further. Moreover, there’s a pretty high cost in terms of slowing down update
request processing in SolrCloud by having the leader block until it knows the
result of the request on the replicas. In other words, this functionality is
not for free but may still be useful for some applications?
To me, at the end of the day, what’s really needed is to ensure that any update
requests that were ack’d back to the client are not lost. This could happen
under the current architecture if the leader is the only replica that has a
write and then fails and doesn’t recover before another replica recovers and
resumes the leadership role (after waiting too long to see the previous leader
come back). Thus, from where I sit, our efforts are better spent on continuing
to harden the leader failover and recovery processes and applications needing
stronger guarantees should have more replicas. SolrCloud should then just focus
on only allowing in-sync replicas to become the leader using strategies like
what was provided with SOLR-5495.
> Option to enforce a majority quorum approach to accepting updates in SolrCloud
> ------------------------------------------------------------------------------
>
> Key: SOLR-5468
> URL: https://issues.apache.org/jira/browse/SOLR-5468
> Project: Solr
> Issue Type: New Feature
> Components: SolrCloud
> Affects Versions: 4.5
> Environment: All
> Reporter: Timothy Potter
> Assignee: Timothy Potter
> Priority: Minor
> Attachments: SOLR-5468.patch
>
>
> I've been thinking about how SolrCloud deals with write-availability using
> in-sync replica sets, in which writes will continue to be accepted so long as
> there is at least one healthy node per shard.
> For a little background (and to verify my understanding of the process is
> correct), SolrCloud only considers active/healthy replicas when acknowledging
> a write. Specifically, when a shard leader accepts an update request, it
> forwards the request to all active/healthy replicas and only considers the
> write successful if all active/healthy replicas ack the write. Any down /
> gone replicas are not considered and will sync up with the leader when they
> come back online using peer sync or snapshot replication. For instance, if a
> shard has 3 nodes, A, B, C with A being the current leader, then writes to
> the shard will continue to succeed even if B & C are down.
> The issue is that if a shard leader continues to accept updates even if it
> loses all of its replicas, then we have acknowledged updates on only 1 node.
> If that node, call it A, then fails and one of the previous replicas, call it
> B, comes back online before A does, then any writes that A accepted while the
> other replicas were offline are at risk to being lost.
> SolrCloud does provide a safe-guard mechanism for this problem with the
> leaderVoteWait setting, which puts any replicas that come back online before
> node A into a temporary wait state. If A comes back online within the wait
> period, then all is well as it will become the leader again and no writes
> will be lost. As a side note, sys admins definitely need to be made more
> aware of this situation as when I first encountered it in my cluster, I had
> no idea what it meant.
> My question is whether we want to consider an approach where SolrCloud will
> not accept writes unless there is a majority of replicas available to accept
> the write? For my example, under this approach, we wouldn't accept writes if
> both B&C failed, but would if only C did, leaving A & B online. Admittedly,
> this lowers the write-availability of the system, so may be something that
> should be tunable?
> From Mark M: Yeah, this is kind of like one of many little features that we
> have just not gotten to yet. I’ve always planned for a param that let’s you
> say how many replicas an update must be verified on before responding
> success. Seems to make sense to fail that type of request early if you notice
> there are not enough replicas up to satisfy the param to begin with.
--
This message was sent by Atlassian JIRA
(v6.2#6252)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]