[
https://issues.apache.org/jira/browse/KUDU-1391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15231493#comment-15231493
]
Mike Percy commented on KUDU-1391:
----------------------------------
I'll submit a patch to rev the term when we see a higher term from a follower.
Although as you point out above, this won't really solve the issue.
I also think election pre-votes are generally helpful for cluster leader
stability. But again, as you pointed out later, the main issue here is the
partially committed config.
I believe that it would be legal for a non-leader to act as a bootstrap source.
Here is my thinking:
* A blank member of a config is clearly "down". Therefore that node is part of
a minority of failed nodes.
* Say a new node got bootstrapped from a config member that is out of date,
instead of the leader. Then we restart all servers. The out of date server and
the new node must constitute a minority, and therefore neither can be elected.
Eventually an up-to-date node will be elected leader and these nodes will be
caught up.
So I think it is safe to bootstrap a new node from a candidate node, not just a
leader. It is possible to implement that and I think we should do it.
> 2 of 3 replica alive but failed to elect leader
> -----------------------------------------------
>
> Key: KUDU-1391
> URL: https://issues.apache.org/jira/browse/KUDU-1391
> Project: Kudu
> Issue Type: Bug
> Reporter: Binglin Chang
> Attachments: 6a32cfa0353e4175809c2aa67e16ac9e.log.st172,
> 6a32cfa0353e4175809c2aa67e16ac9e.log.st212,
> 6a32cfa0353e4175809c2aa67e16ac9e.log.st212.before,
> 6a32cfa0353e4175809c2aa67e16ac9e.log.st216, remote-bootstrap-tool.patch
>
>
> Last weekend many TS have a lot too many open files error(haven't upgrade to
> , when using our internal deploy tool to restart cluster (stop all ts, then
> start all ts), the control machine have some issue which seems to block or
> write to ssh terminal(maybe usb driver issue, not related to this bug), so
> only half (about 30) of the TS is shutdown, then after maybe 10 minutes, I
> switch to another control host and perform the whole restart.
> Then I see writes are blocked, because 1 tablet is in no leader state, from
> web-ui, 2 of 3 replicas is in follower state, 1 TABLET_DATA_TOMBSTONED, but
> all election failed, will attach the log of the 2 followers.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)