[ 
https://issues.apache.org/jira/browse/KUDU-3010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adar Dembo updated KUDU-3010:
-----------------------------
    Description: 
I've seen a case of running the {{unsafe_change_config}} tool, per the steps 
laid out in the ["Bringing a tablet that has lost a majority of 
replicas"|https://kudu.apache.org/docs/administration.html#tablet_majority_down_recovery]
 steps, crashing a tserver with the following error:

{code:java}
I1028 08:24:31.241361 38436 raft_consensus.cc:684] T 
b90b0429806747a6b993d8543ab5fd50 P f344ade17ed94072b8839007ccc7570a [term 40 
FOLLOWER]: Illegal state: RaftConfig change currently pending. Only one is 
allowed at a time.
W1028 08:24:31.241379 38436 raft_consensus.cc:1373] T 
b90b0429806747a6b993d8543ab5fd50 P f344ade17ed94072b8839007ccc7570a [term 40 
FOLLOWER]: Could not prepare transaction for op 34.48 and following 69 ops. 
Status for this op: Illegal state: RaftConfig change currently pending. Only 
one is allowed at a time.
I1028 08:26:07.300520 38436 raft_consensus.cc:1058] T 
a6bfa86e43b74cfaa6feba4631879251 P f344ade17ed94072b8839007ccc7570a [term 17 
FOLLOWER]: Refusing update from remote peer f1a7fb14b7b44a5c8b31e93114d79a8d: 
Log matching property violated. Preceding OpId in replica: term: 15 index: 93. 
Preceding OpId from leader: term: 17 index: 112. (index mismatch)
I1028 08:26:07.301476 38436 raft_consensus.cc:2819] T 
a6bfa86e43b74cfaa6feba4631879251 P f344ade17ed94072b8839007ccc7570a [term 17 
NON_PARTICIPANT]: Allowing unsafe config change even though there is a pending 
config! Existing pending config: opid_index: 95 OBSOLETE_local: false peers { 
permanent_uuid: "7875cc5598a44bd893998cba7bd2cc47" member_type: VOTER 
last_known_addr{ host: "foo01.server.net" port: 7050 }
attrs{ promote: false }
} peers { permanent_uuid: "f1a7fb14b7b44a5c8b31e93114d79a8d" member_type: 
NON_VOTER last_known_addr{ host: "foo04.server.net" port: 7050 }
attrs{ promote: true }
} unsafe_config_change: true; New pending config: opid_index: 96 
OBSOLETE_local: false peers { permanent_uuid: 
"7875cc5598a44bd893998cba7bd2cc47" member_type: VOTER last_known_addr{ host: 
"foo01.server.net" port: 7050 }
attrs{ promote: false }
} peers { permanent_uuid: "f1a7fb14b7b44a5c8b31e93114d79a8d" member_type: 
NON_VOTER last_known_addr{ host: "foo04.server.net" port: 7050 }
attrs{ promote: true }
} peers { permanent_uuid: "231e6fdad22647978c9a76c07407da4c" member_type: 
NON_VOTER last_known_addr{ host: "foo02.server.net" port: 7050 }
attrs{ promote: true }
} unsafe_config_change: true
F1028 08:26:07.302338 38436 pending_rounds.cc:179] Check failed: _s.ok() Bad 
status: Corruption: New operation's term is not >= than the previous op's term. 
Current: 14.94. Previous: 15.93
{code}

It seems like the tool is permitting the persistence of a bad op, considering 
there's already a config change in flight.

  was:
I've seen a case of running the {{unsafe_change_config}} tool, per the steps 
laid out in the ["Bringing a tablet that has lost a majority of 
replicas"|https://kudu.apache.org/docs/administration.html#tablet_majority_down_recovery]
 steps, crashing a tserver with the following error:

{code:java}
I1028 08:24:31.241361 38436 raft_consensus.cc:684] T 
b90b0429806747a6b993d8543ab5fd50 P f344ade17ed94072b8839007ccc7570a [term 40 
FOLLOWER]: Illegal state: RaftConfig change currently pending. Only one is 
allowed at a time.
W1028 08:24:31.241379 38436 raft_consensus.cc:1373] T 
b90b0429806747a6b993d8543ab5fd50 P f344ade17ed94072b8839007ccc7570a [term 40 
FOLLOWER]: Could not prepare transaction for op 34.48 and following 69 ops. 
Status for this op: Illegal state: RaftConfig change currently pending. Only 
one is allowed at a time.I1028 08:26:07.300520 38436 raft_consensus.cc:1058] T 
a6bfa86e43b74cfaa6feba4631879251 P f344ade17ed94072b8839007ccc7570a [term 17 
FOLLOWER]: Refusing update from remote peer f1a7fb14b7b44a5c8b31e93114d79a8d: 
Log matching property violated. Preceding OpId in replica: term: 15 index: 93. 
Preceding OpId from leader: term: 17 index: 112. (index mismatch)I1028 
08:26:07.301476 38436 raft_consensus.cc:2819] T 
a6bfa86e43b74cfaa6feba4631879251 P f344ade17ed94072b8839007ccc7570a [term 17 
NON_PARTICIPANT]: Allowing unsafe config change even though there is a pending 
config! Existing pending config: opid_index: 95 OBSOLETE_local: false peers { 
permanent_uuid: "7875cc5598a44bd893998cba7bd2cc47" member_type: VOTER 
last_known_addr{ host: "foo01.server.net" port: 7050 }
attrs{ promote: false }
} peers { permanent_uuid: "f1a7fb14b7b44a5c8b31e93114d79a8d" member_type: 
NON_VOTER last_known_addr{ host: "foo04.server.net" port: 7050 }
attrs{ promote: true }
} unsafe_config_change: true; New pending config: opid_index: 96 
OBSOLETE_local: false peers { permanent_uuid: 
"7875cc5598a44bd893998cba7bd2cc47" member_type: VOTER last_known_addr{ host: 
"foo01.server.net" port: 7050 }
attrs{ promote: false }
} peers { permanent_uuid: "f1a7fb14b7b44a5c8b31e93114d79a8d" member_type: 
NON_VOTER last_known_addr{ host: "foo04.server.net" port: 7050 }
attrs{ promote: true }
} peers { permanent_uuid: "231e6fdad22647978c9a76c07407da4c" member_type: 
NON_VOTER last_known_addr{ host: "foo02.server.net" port: 7050 }
attrs{ promote: true }
} unsafe_config_change: trueF1028 08:26:07.302338 38436 pending_rounds.cc:179] 
Check failed: _s.ok() Bad status: Corruption: New operation's term is not >= 
than the previous op's term. Current: 14.94. Previous: 15.93
{code}

It seems like the tool is permitting the persistence of a bad op, considering 
there's already a config change in flight.


> unsafe_change_config can lead to a crash
> ----------------------------------------
>
>                 Key: KUDU-3010
>                 URL: https://issues.apache.org/jira/browse/KUDU-3010
>             Project: Kudu
>          Issue Type: Bug
>          Components: consensus, ops-tooling
>            Reporter: Andrew Wong
>            Priority: Major
>
> I've seen a case of running the {{unsafe_change_config}} tool, per the steps 
> laid out in the ["Bringing a tablet that has lost a majority of 
> replicas"|https://kudu.apache.org/docs/administration.html#tablet_majority_down_recovery]
>  steps, crashing a tserver with the following error:
> {code:java}
> I1028 08:24:31.241361 38436 raft_consensus.cc:684] T 
> b90b0429806747a6b993d8543ab5fd50 P f344ade17ed94072b8839007ccc7570a [term 40 
> FOLLOWER]: Illegal state: RaftConfig change currently pending. Only one is 
> allowed at a time.
> W1028 08:24:31.241379 38436 raft_consensus.cc:1373] T 
> b90b0429806747a6b993d8543ab5fd50 P f344ade17ed94072b8839007ccc7570a [term 40 
> FOLLOWER]: Could not prepare transaction for op 34.48 and following 69 ops. 
> Status for this op: Illegal state: RaftConfig change currently pending. Only 
> one is allowed at a time.
> I1028 08:26:07.300520 38436 raft_consensus.cc:1058] T 
> a6bfa86e43b74cfaa6feba4631879251 P f344ade17ed94072b8839007ccc7570a [term 17 
> FOLLOWER]: Refusing update from remote peer f1a7fb14b7b44a5c8b31e93114d79a8d: 
> Log matching property violated. Preceding OpId in replica: term: 15 index: 
> 93. Preceding OpId from leader: term: 17 index: 112. (index mismatch)
> I1028 08:26:07.301476 38436 raft_consensus.cc:2819] T 
> a6bfa86e43b74cfaa6feba4631879251 P f344ade17ed94072b8839007ccc7570a [term 17 
> NON_PARTICIPANT]: Allowing unsafe config change even though there is a 
> pending config! Existing pending config: opid_index: 95 OBSOLETE_local: false 
> peers { permanent_uuid: "7875cc5598a44bd893998cba7bd2cc47" member_type: VOTER 
> last_known_addr{ host: "foo01.server.net" port: 7050 }
> attrs{ promote: false }
> } peers { permanent_uuid: "f1a7fb14b7b44a5c8b31e93114d79a8d" member_type: 
> NON_VOTER last_known_addr{ host: "foo04.server.net" port: 7050 }
> attrs{ promote: true }
> } unsafe_config_change: true; New pending config: opid_index: 96 
> OBSOLETE_local: false peers { permanent_uuid: 
> "7875cc5598a44bd893998cba7bd2cc47" member_type: VOTER last_known_addr{ host: 
> "foo01.server.net" port: 7050 }
> attrs{ promote: false }
> } peers { permanent_uuid: "f1a7fb14b7b44a5c8b31e93114d79a8d" member_type: 
> NON_VOTER last_known_addr{ host: "foo04.server.net" port: 7050 }
> attrs{ promote: true }
> } peers { permanent_uuid: "231e6fdad22647978c9a76c07407da4c" member_type: 
> NON_VOTER last_known_addr{ host: "foo02.server.net" port: 7050 }
> attrs{ promote: true }
> } unsafe_config_change: true
> F1028 08:26:07.302338 38436 pending_rounds.cc:179] Check failed: _s.ok() Bad 
> status: Corruption: New operation's term is not >= than the previous op's 
> term. Current: 14.94. Previous: 15.93
> {code}
> It seems like the tool is permitting the persistence of a bad op, considering 
> there's already a config change in flight.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to