[
https://issues.apache.org/jira/browse/KUDU-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jean-Daniel Cryans updated KUDU-1278:
-------------------------------------
Target Version/s: 0.7.0 (was: 0.8.0)
> Tablets that take >5 minutes to copy will never remote bootstrap
> ----------------------------------------------------------------
>
> Key: KUDU-1278
> URL: https://issues.apache.org/jira/browse/KUDU-1278
> Project: Kudu
> Issue Type: Bug
> Components: consensus
> Affects Versions: 0.6.0
> Reporter: Todd Lipcon
> Assignee: Binglin Chang
> Priority: Blocker
> Fix For: 0.7.0
>
>
> [~decster] and I debugged this issue on his cluster. One of the servers had
> been shut down due to bad RAM, so it triggered remote bootstrap of all of its
> tablets to create new replicas.
> During remote bootstrap, the leader replica continues to try to replicate
> operations to the new follower, while it's in the process of bootstrapping.
> This causes it to try to trigger remote bootstrap, which fails with a "Remote
> bootstrap already in progress" error. The leader considers this to be an
> unsuccessful communication with the follower. After 5 minutes of receiving
> this error, it will decide that the follower is dead and evict it, and
> request another new replica. When the previous replica finishes, it will find
> out that it's been evicted, and delete everything it just copied. This cycle
> repeats forever.
> We need to fix the leader so that, as long as the remote bootstrapping
> replica is making progress, we don't consider it dead.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)