[jira] [Commented] (KUDU-1408) Adding a replica may never succeed if copying tablet takes longer than the log retention time

Dinesh Bhat (JIRA) Wed, 17 Aug 2016 18:20:06 -0700

    [ 
https://issues.apache.org/jira/browse/KUDU-1408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15425753#comment-15425753
 ]


Dinesh Bhat commented on KUDU-1408:
-----------------------------------

Hi [~tlipcon], few naive Qs from my end:
    - Does 'anchoring a log' mean checkpointing in the WAL context ? I tried to 
gerp this term in our docs before going to the code, but no success.
    - Also, is this workflow any different from rejoining a stale replica to 
the cluster ? (stale means going as far back as starting a new replica ?)
    -  Also, instead of sending anchor logs via heartbeats(as you seem to be 
suggesting above), could we not rely on the follower to suggest the leader 
       where he wants to replay the logs from ? In other words, the remote 
bootstrap begins by leader asking the follower what was his(or her :)) last 
checkpoint. 
       I could be missing several dots here though.

> Adding a replica may never succeed if copying tablet takes longer than the 
> log retention time
> ---------------------------------------------------------------------------------------------
>
>                 Key: KUDU-1408
>                 URL: https://issues.apache.org/jira/browse/KUDU-1408
>             Project: Kudu
>          Issue Type: Bug
>          Components: consensus, tserver
>    Affects Versions: 0.8.0
>            Reporter: Todd Lipcon
>            Priority: Critical
>
> Currently, while a remote bootstrap session is in progress, we anchor the 
> logs from the time at which it started. However, as soon as the session 
> finishes, we drop the anchor, and delete any logs. In the case where the 
> tablet copy itself takes longer than the log retention period, this means 
> it's likely to have a scenario like:
> - TS A starts downloading from TS B. It plans to download segments 1-4 and 
> adds an anchor.
> - TS B handles writes for 20 minutes, rolling the log many times (e.g. up to 
> log segment 20)
> - TS A finishes downloading, and ends the remote bootstrap session
> - TS B no longer has an anchor, so GCs all logs 1-16.
> - TS A finishes opening the tablet it just copied, but immediately is unable 
> to catch up (because it only has segments 1-4, but the leader only has 17-20)
> - TS B evicts TS A
> This loop will go on basically forever until the write workload stops on TS B.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (KUDU-1408) Adding a replica may never succeed if copying tablet takes longer than the log retention time

Reply via email to