[
https://issues.apache.org/jira/browse/KUDU-1735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Todd Lipcon updated KUDU-1735:
------------------------------
Assignee: Todd Lipcon
Code Review: http://gerrit.cloudera.org:8080/4916
Posted a WIP and testing it on a cluster that experienced this issue.
> CHECK failure when aborting an ignored config change operation
> --------------------------------------------------------------
>
> Key: KUDU-1735
> URL: https://issues.apache.org/jira/browse/KUDU-1735
> Project: Kudu
> Issue Type: Bug
> Components: consensus
> Affects Versions: 1.0.1
> Reporter: Todd Lipcon
> Assignee: Todd Lipcon
> Priority: Critical
>
> The following sequence causes a CHECK failure:
> - a tablet server receives a CONFIG_CHANGE operation
> - the tablet server commits the operation (writing the new consensus config
> to disk), but crashes before it can write the associated COMMIT message to
> the log
> - the server is down for long enough that it is removed from the
> configuration again while it's down
> - when it comes back up, it sees the CONFIG_CHANGE again as a pending
> replicate. When it's added to PendingRounds, it is ignored as we can see that
> this configuration is already committed.
> - the tserver gets the request from the master to DeleteTablet because it's
> no longer part of the configuration
> -- when trying to abort the operation, it fires a CHECK "Aborting
> CHANGE_CONFIG_OP but there was no pending config set."
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)