Todd Lipcon created KUDU-1408:
---------------------------------

             Summary: Adding a replica may never succeed if copying tablet 
takes longer than the log retention time
                 Key: KUDU-1408
                 URL: https://issues.apache.org/jira/browse/KUDU-1408
             Project: Kudu
          Issue Type: Bug
          Components: consensus, tserver
    Affects Versions: 0.8.0
            Reporter: Todd Lipcon
            Priority: Critical


Currently, while a remote bootstrap session is in progress, we anchor the logs 
from the time at which it started. However, as soon as the session finishes, we 
drop the anchor, and delete any logs. In the case where the tablet copy itself 
takes longer than the log retention period, this means it's likely to have a 
scenario like:

- TS A starts downloading from TS B. It plans to download segments 1-4 and adds 
an anchor.
- TS B handles writes for 20 minutes, rolling the log many times (e.g. up to 
log segment 20)
- TS A finishes downloading, and ends the remote bootstrap session
- TS B no longer has an anchor, so GCs all logs 1-16.
- TS A finishes opening the tablet it just copied, but immediately is unable to 
catch up (because it only has segments 1-4, but the leader only has 17-20)
- TS B evicts TS A

This loop will go on basically forever until the write workload stops on TS B.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to