[ 
https://issues.apache.org/jira/browse/KUDU-1735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15613069#comment-15613069
 ] 

Todd Lipcon commented on KUDU-1735:
-----------------------------------

Brainstorming a couple fixes here:
- perhaps we should wait until all COMMIT messages are flushed before we flush 
the new cmeta to disk when a config change is committed?
- the 'Abort a failed config change' code path seems to be somewhat incorrect 
as it also doesn't verify that the pending config change that's being cleared 
matches the one that failed
- maybe the CHECK is invalid if we see that we're aborting a config change that 
is older than the current committed one?

Would be good to figure out why our various consensus stress tests don't 
trigger this behavior and add one that does (in addition to a more specifically 
targeted test case).

[~mpercy] do you have time to take a look at this as the original author of 
this part of the code?

> CHECK failure when aborting an ignored config change operation
> --------------------------------------------------------------
>
>                 Key: KUDU-1735
>                 URL: https://issues.apache.org/jira/browse/KUDU-1735
>             Project: Kudu
>          Issue Type: Bug
>          Components: consensus
>    Affects Versions: 1.0.1
>            Reporter: Todd Lipcon
>            Priority: Critical
>
> The following sequence causes a CHECK failure:
> - a tablet server receives a CONFIG_CHANGE operation
> - the tablet server commits the operation (writing the new consensus config 
> to disk), but crashes before it can write the associated COMMIT message to 
> the log
> - the server is down for long enough that it is removed from the 
> configuration again while it's down
> - when it comes back up, it sees the CONFIG_CHANGE again as a pending 
> replicate. When it's added to PendingRounds, it is ignored as we can see that 
> this configuration is already committed.
> - the tserver gets the request from the master to DeleteTablet because it's 
> no longer part of the configuration
> -- when trying to abort the operation, it fires a CHECK "Aborting 
> CHANGE_CONFIG_OP but there was no pending config set."



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to