[
https://issues.apache.org/jira/browse/KUDU-921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15301682#comment-15301682
]
Todd Lipcon commented on KUDU-921:
----------------------------------
[~mpercy] are you working on this? I think this is more critical than we
originally had thought -- the remote bootstrap client is occupying a handler
thread of ConsensusService. As far as I remember, we don't limit the number of
concurrent remote bootstrap clients on a given server. So, in a small cluster
(eg 5 nodes) where there are 100+ tablets per server, a crash can generate >10
new remote bootstraps on a single node. These then occupy the 10
ConsensusService handlers, meaning that all other tablets are unable to
heartbeat or replicate to this node for several minutes. In the worst case,
this can cause cascading failure -- because the other nodes can't heartbeat to
the node, they think it's dead, and cause _more_ remote bootstraps to start.
Given that it could result in a cascading issue, and that the fix is probably
relatively simple (just add another thread pool with no queueing so that we
limit concurrency), we should probably tackle this ASAP. If you're busy on
other stuff, let me know and I'll take it on.
> TSTabletManager: Run remote bootstrap on background thread
> ----------------------------------------------------------
>
> Key: KUDU-921
> URL: https://issues.apache.org/jira/browse/KUDU-921
> Project: Kudu
> Issue Type: Bug
> Components: consensus
> Affects Versions: Feature Complete
> Reporter: Mike Percy
> Assignee: Mike Percy
>
> StartRemoteBootstrap() should not run the whole remote bootstrap procedure on
> the caller's thread.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)