[ 
https://issues.apache.org/jira/browse/KUDU-921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15301682#comment-15301682
 ] 

Todd Lipcon commented on KUDU-921:
----------------------------------

[~mpercy] are you working on this? I think this is more critical than we 
originally had thought -- the remote bootstrap client is occupying a handler 
thread of ConsensusService. As far as I remember, we don't limit the number of 
concurrent remote bootstrap clients on a given server. So, in a small cluster 
(eg 5 nodes) where there are 100+ tablets per server, a crash can generate >10 
new remote bootstraps on a single node. These then occupy the 10 
ConsensusService handlers, meaning that all other tablets are unable to 
heartbeat or replicate to this node for several minutes. In the worst case, 
this can cause cascading failure -- because the other nodes can't heartbeat to 
the node, they think it's dead, and cause _more_ remote bootstraps to start.

Given that it could result in a cascading issue, and that the fix is probably 
relatively simple (just add another thread pool with no queueing so that we 
limit concurrency), we should probably tackle this ASAP. If you're busy on 
other stuff, let me know and I'll take it on.

> TSTabletManager: Run remote bootstrap on background thread
> ----------------------------------------------------------
>
>                 Key: KUDU-921
>                 URL: https://issues.apache.org/jira/browse/KUDU-921
>             Project: Kudu
>          Issue Type: Bug
>          Components: consensus
>    Affects Versions: Feature Complete
>            Reporter: Mike Percy
>            Assignee: Mike Percy
>
> StartRemoteBootstrap() should not run the whole remote bootstrap procedure on 
> the caller's thread.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to