[
https://issues.apache.org/jira/browse/KUDU-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15284015#comment-15284015
]
Todd Lipcon commented on KUDU-1194:
-----------------------------------
[~mpercy] - any thoughts on this one? I couldn't quite tell from reading the
description how it is that a server could get into this state. It seems like
it's at least an uncommon scenario, so maybe Critical priority is too high?
What do you think?
> consensus: Allow abort of uncommittable config change ops
> ---------------------------------------------------------
>
> Key: KUDU-1194
> URL: https://issues.apache.org/jira/browse/KUDU-1194
> Project: Kudu
> Issue Type: Improvement
> Reporter: Mike Percy
> Assignee: Mike Percy
> Priority: Critical
>
> Wanted to capture a few thoughts about manually fixing broken configs or
> automatically rolling back bad config changes. This isn't a fully baked
> design, just wanted to jot down some initial thoughts.
> A general way to (attempt to) abort uncommitted ops is to truncate the Raft
> log on the leader (and replace the op with a NO_OP or something similar).
> Some thoughts on recovering from "bad" configs:
> * We may hit a situation where there is an in-progress config change
> operation that will be impossible to commit due to a majority of the nodes in
> the "target" config being permanently dead. If the leader is still alive, we
> can provide a timeout on these ops or a way to explicitly (via RPC) abort
> them by truncating the log.
> * If no leader is alive, and it's impossible to elect one, then we could
> write an "unsafe" tool only for emergency use that could do something evil
> like make the follower think that the tool is the new leader and append an
> unsafe change-config op to the follower's log.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)