[ 
https://issues.apache.org/jira/browse/KUDU-2924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16911703#comment-16911703
 ] 

Adar Dembo commented on KUDU-2924:
----------------------------------

Note: for the "separate bug" to trigger, there also has to be a roll of the WAL 
between ops L and M. Otherwise anchoring M is equivalent to anchoring L in that 
either will prevent GC'ing the WAL segment that contains both L and M. This 
requires a high rate of ingest, large write batches (perhaps many columns), and 
unfortunate timing.

> Let newly rereplicated replicas try to catch up before evicting them
> --------------------------------------------------------------------
>
>                 Key: KUDU-2924
>                 URL: https://issues.apache.org/jira/browse/KUDU-2924
>             Project: Kudu
>          Issue Type: Bug
>          Components: consensus, tablet copy
>    Affects Versions: 1.11.0
>            Reporter: Adar Dembo
>            Priority: Major
>
> In heavily loaded clusters with a high rate of ingest, laggy FOLLOWER 
> eviction can lead to unsatisfiable tablet copy loops. This plays out 
> something like this:
> # Replication group containing replicas A, B, C. A is the leader.
> # Due to load, C starts to lag behind A.
> # Eventually, C is evicted.
> # A new replica D is added elsewhere and tablet copy begins from A. It's 
> going to copy WAL ops M..N, where M is the oldest op not yet flushed, and N 
> is the most recent op written.
> # Due to a separate bug (detailed below), A actually thinks D needs ops L..N 
> where L is close to but a bit before M.
> # More and more data is written to A and replicated to B. The op index 
> eventually climbs up to O, where segment(O) - segment(M) exceeds the maximum 
> number of segments to retain.
> # A GCs all ops up to M, including L. D can no longer catch up and is 
> evicted, even before the tablet copy is finished.
> # A new replica E is added and tablet copy begins from A. The cycle repeats.
> Even if that separate bug is fixed, A will release its anchor on ops M..N 
> when D finishes copying, which means D will still be evicted before it has a 
> chance to catch up.
> Why does this matter? Isn't it "correct" that D can't catch up and thus 
> should be evicted? Well, yes, but we've just spent a bunch of cluster 
> resources on a tablet copy that amounted to nothing useful. We should try to 
> get our money's worth first by giving D one "free" catch-up: don't evict D 
> unless it falls behind _after catching up to O_, or if some timer expires.
> The aforementioned separate bug: the addition of D and its tablet copy are 
> two separate events. When D is added, we use a conservative estimate to 
> figure out what op it should have:
> {noformat}
>   // We don't know the last operation received by the peer so, following the
>   // Raft protocol, we set next_index to one past the end of our own log. This
>   // way, if calling this method is the result of a successful leader election
>   // and the logs between the new leader and remote peer match, the
>   // peer->next_index will point to the index of the soon-to-be-written NO_OP
>   // entry that is used to assert leadership. If we guessed wrong, and the 
> peer
>   // does not have a log that matches ours, the normal queue negotiation
>   // process will eventually find the right point to resume from.
>   tracked_peer->next_index = queue_state_.last_appended.index() + 1;
> {noformat}
> When the tablet copy begins, A anchors to the last op in its WAL. If the 
> tablet copy starts after the addition of D, {{tracked_peer->next_index}} will 
> be too conservative, and even though all the necessary ops will be copied to 
> D, A may evict D if {{tracked_peer->next_index}} is GC'ed.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

Reply via email to