[jira] [Created] (KUDU-2924) Let newly rereplicated replicas try to catch up before evicting them

Adar Dembo (Jira) Tue, 20 Aug 2019 13:00:37 -0700

Adar Dembo created KUDU-2924:
--------------------------------

             Summary: Let newly rereplicated replicas try to catch up before 
evicting them
                 Key: KUDU-2924
                 URL: https://issues.apache.org/jira/browse/KUDU-2924
             Project: Kudu
          Issue Type: Bug
          Components: consensus, tablet copy
    Affects Versions: 1.11.0
            Reporter: Adar Dembo



In heavily loaded clusters with a high rate of ingest, laggy FOLLOWER eviction 
can lead to unsatisfiable tablet copy loops. This plays out something like this:
# Replication group containing replicas A, B, C. A is the leader.
# Due to load, C starts to lag behind A.
# Eventually, C is evicted.
# A new replica D is added elsewhere and tablet copy begins from A. It's going 
to copy WAL ops M..N, where M is the oldest op not yet flushed, and N is the 
most recent op written.
# Due to a separate bug (detailed below), A actually thinks D needs ops L..N 
where L is close to but a bit before M.
# More and more data is written to A and replicated to B. The op index 
eventually climbs up to O, where segment(O) - segment(M) exceeds the maximum 
number of segments to retain.
# A GCs all ops up to M, including L. D can no longer catch up and is evicted, 
even before the tablet copy is finished.
# A new replica E is added and tablet copy begins from A. The cycle repeats.

Even if that separate bug is fixed, A will release its anchor on ops M..N when 
D finishes copying, which means D will still be evicted before it has a chance 
to catch up.

Why does this matter? Isn't it "correct" that D can't catch up and thus should 
be evicted? Well, yes, but we've just spent a bunch of cluster resources on a 
tablet copy that amounted to nothing useful. We should try to get our money's 
worth first by giving D one "free" catch-up: don't evict D unless it falls 
behind _after catching up to O_, or if some timer expires.

The aforementioned separate bug: the addition of D and its tablet copy are two 
separate events. When D is added, we use a conservative estimate to figure out 
what op it should have:
{noformat}
  // We don't know the last operation received by the peer so, following the
  // Raft protocol, we set next_index to one past the end of our own log. This
  // way, if calling this method is the result of a successful leader election
  // and the logs between the new leader and remote peer match, the
  // peer->next_index will point to the index of the soon-to-be-written NO_OP
  // entry that is used to assert leadership. If we guessed wrong, and the peer
  // does not have a log that matches ours, the normal queue negotiation
  // process will eventually find the right point to resume from.
  tracked_peer->next_index = queue_state_.last_appended.index() + 1;
{noformat}

When the tablet copy begins, A anchors to the last op in its WAL. If the tablet 
copy starts after the addition of D, {{tracked_peer->next_index}} will be too 
conservative, and even though all the necessary ops will be copied to D, A may 
evict D if {{tracked_peer->next_index}} is GC'ed.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Created] (KUDU-2924) Let newly rereplicated replicas try to catch up before evicting them

Reply via email to