[ 
https://issues.apache.org/jira/browse/KUDU-1338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15185230#comment-15185230
 ] 

Todd Lipcon commented on KUDU-1338:
-----------------------------------

Looking at the code, I think the above might be the issue:

TryRemoveFollowerTask does:
{code}
  WARN_NOT_OK(ChangeConfig(req, Bind(&DoNothingStatusCB), &error_code),
              state_->LogPrefixThreadSafe() + "Unable to remove follower " + 
uuid);
{code}
(i.e. binds DoNothingStatusCB as 'client_cb'). That does:
{code}
    RETURN_NOT_OK(ReplicateConfigChangeUnlocked(committed_config, new_config,
                                                
Bind(&RaftConsensus::MarkDirtyOnSuccess,
                                                     Unretained(this),
                                                     string("Config change 
replication complete"),
                                                     client_cb)));
{code}
i.e. 'client_cb' is now a wrapper which calls still doesn't handle failure. 
ReplicateConfigChangeUnlocked does:
{code}
  
round->SetConsensusReplicatedCallback(Bind(&RaftConsensus::NonTxRoundReplicationFinished,
                                             Unretained(this),
                                             Unretained(round.get()),
                                             client_cb));
{code}

NonTxRoundReplicationFinished does:
{code}
  if (!status.ok()) {
    // TODO: Do something with the status on failure?
    LOG(INFO) << state_->LogPrefixThreadSafe() << op_type_str << " replication 
failed: "
              << status.ToString();
    client_cb.Run(status);
    return;
  }
{code}

where that TODO looks awfully relevant. If a config change gets aborted, we 
probably need to go back to using the old config, right?

> Tablet stuck in RaftConfig change currently pending
> ---------------------------------------------------
>
>                 Key: KUDU-1338
>                 URL: https://issues.apache.org/jira/browse/KUDU-1338
>             Project: Kudu
>          Issue Type: Bug
>          Components: consensus
>    Affects Versions: 0.7.0
>            Reporter: Jean-Daniel Cryans
>            Priority: Critical
>         Attachments: KUDU_TSERVER.node-2.internal.gz, 
> KUDU_TSERVER.node-3.internal.gz, KUDU_TSERVER.node-5.internal.gz, logs.tgz
>
>
> We've been adapting the consensus logs for a while and I think we can finally 
> get to the bottom of this issue. I'm attaching the logs from the 3 nodes that 
> participated in the same config for tablet eaa1877a2b3540cf8202aff844c6ca79.
> ITBLL is driving the load and eventually fails at 2016-02-15 14:53:12,005 
> trying to write to node-2 AKA a1081edd2ca24f6b9dcdd7e5000f95ec. The peer that 
> gets stuck is node-5 AKA cdec7fdacbac4ad1b095275b3bdbbe5c, starting from this 
> line:
> {noformat}
> I0215 14:28:41.585695  2020 raft_consensus_state.cc:459] T 
> eaa1877a2b3540cf8202aff844c6ca79 P cdec7fdacbac4ad1b095275b3bdbbe5c [term 69 
> FOLLOWER]: Illegal state: RaftConfig change currently pending. Only one is 
> allowed at a time.
> {noformat}
> The chaos monkey running on this setup is dropping packets one node at time.
> I'll attach the logs in a moment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to