[ 
https://issues.apache.org/jira/browse/KUDU-1408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15280703#comment-15280703
 ] 

Todd Lipcon commented on KUDU-1408:
-----------------------------------

Trying to think through a couple of ideas to solve this issue... a few thoughts 
follow:

*Maybe we should change consensus to anchor the log as far back as its 
farthest-behind follower?*
In the case that there is a follower that has been added but not yet 
successfully replicated, we would consider this "infinitely in the past", and 
essentially disable WAL GC.

Upsides:
- definitely would solve this problem

Downsides:
- we currently see it as a sort of feature that if one of the replicas isn't 
keeping up, that we evict it and replace it with a new replica. This is good to 
counter "limplock" (http://ucare.cs.uchicago.edu/pdf/socc13-limplock.pdf) 
issues. This would break that (we'd never evict a replica that was 'limping 
along', but instead it would just get farther and farther behind. We could 
certainly design around this with various heuristics to estimate whether a 
replica is getting farther behind or catching up, but it's not trivial.
- there's a bit of hidden complexity here: only the leader currently knows how 
far behind the other replicas are. So, if there's a leader re-election, the new 
leader is likely to have already evicted the old logs, in which case we're back 
to the same issue. So, for this to work, the leader would probably need to be 
propagating the anchor along with its heartbeats.

*Would a "grace period" of retention after a remote bootstrap be sufficient?*

For example, after remote bootstrapping a new node, should we just hold the 
remote bootstrap session's log anchor for some number of minutes before 
releasing it? This has the same issue noted above where only the remote 
bootstrap "source" node would hold the anchor, and thus wouldn't work if there 
were any leader election during the catch-up time.

*What if "catch up" is impossible? Do we need to actually slow down the 
majority to let the new follower catch up?*

In the case that the tablet is handling the "max throughput" of writes, it 
seems quite plausible that a follower trying to catch up can't process the 
writes any faster than they were originally processed by the two "good" nodes. 
You could expect the follower to be slightly faster in most cases because it's 
receiving the writes in large batches, but in some cases it might also be 
slower because its caches will be cold compared to the active replicas.

----

It would be interesting to look into how some other systems handle this sort of 
issue - both consensus based systems and async log-shipping replication systems 
probably have the same issue where it's possible that a follower is unable to 
process operations at the same rate as the leader or majority.

> Adding a replica may never succeed if copying tablet takes longer than the 
> log retention time
> ---------------------------------------------------------------------------------------------
>
>                 Key: KUDU-1408
>                 URL: https://issues.apache.org/jira/browse/KUDU-1408
>             Project: Kudu
>          Issue Type: Bug
>          Components: consensus, tserver
>    Affects Versions: 0.8.0
>            Reporter: Todd Lipcon
>            Priority: Critical
>
> Currently, while a remote bootstrap session is in progress, we anchor the 
> logs from the time at which it started. However, as soon as the session 
> finishes, we drop the anchor, and delete any logs. In the case where the 
> tablet copy itself takes longer than the log retention period, this means 
> it's likely to have a scenario like:
> - TS A starts downloading from TS B. It plans to download segments 1-4 and 
> adds an anchor.
> - TS B handles writes for 20 minutes, rolling the log many times (e.g. up to 
> log segment 20)
> - TS A finishes downloading, and ends the remote bootstrap session
> - TS B no longer has an anchor, so GCs all logs 1-16.
> - TS A finishes opening the tablet it just copied, but immediately is unable 
> to catch up (because it only has segments 1-4, but the leader only has 17-20)
> - TS B evicts TS A
> This loop will go on basically forever until the write workload stops on TS B.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to