[
https://issues.apache.org/jira/browse/KUDU-3010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16984037#comment-16984037
]
Adar Dembo commented on KUDU-3010:
----------------------------------
The quoted logs have quite a bit of info elided from them, not least of which
is the "PROCEEDING WITH UNSAFE CONFIG CHANGE ON THIS SERVER" line, which is
helpful in understanding exactly what was sent to the leader. Any chance you
can share a full log?
Looking at the implementation for unsafe config change, it seems our best
effort for the new term is "current term + 1". I suppose that could be
vulnerable to a race wherein the term increases twice as the change config is
processed, but I don't see anything in the logging to support that this is what
happened.
Another theory: what happens if an unsafe config change request is sent to a
FOLLOWER that has a stale notion of the term (is that even possible)? Would the
subsequent self-Update crash the replica in the way you've indicated?
> unsafe_change_config can lead to a crash
> ----------------------------------------
>
> Key: KUDU-3010
> URL: https://issues.apache.org/jira/browse/KUDU-3010
> Project: Kudu
> Issue Type: Bug
> Components: consensus, ops-tooling
> Reporter: Andrew Wong
> Priority: Major
>
> I've seen a case of running the {{unsafe_change_config}} tool, per the steps
> laid out in the ["Bringing a tablet that has lost a majority of
> replicas"|https://kudu.apache.org/docs/administration.html#tablet_majority_down_recovery]
> steps, crashing a tserver with the following error:
> {code:java}
> I1028 08:24:31.241361 38436 raft_consensus.cc:684] T
> b90b0429806747a6b993d8543ab5fd50 P f344ade17ed94072b8839007ccc7570a [term 40
> FOLLOWER]: Illegal state: RaftConfig change currently pending. Only one is
> allowed at a time.
> W1028 08:24:31.241379 38436 raft_consensus.cc:1373] T
> b90b0429806747a6b993d8543ab5fd50 P f344ade17ed94072b8839007ccc7570a [term 40
> FOLLOWER]: Could not prepare transaction for op 34.48 and following 69 ops.
> Status for this op: Illegal state: RaftConfig change currently pending. Only
> one is allowed at a time.
> I1028 08:26:07.300520 38436 raft_consensus.cc:1058] T
> a6bfa86e43b74cfaa6feba4631879251 P f344ade17ed94072b8839007ccc7570a [term 17
> FOLLOWER]: Refusing update from remote peer f1a7fb14b7b44a5c8b31e93114d79a8d:
> Log matching property violated. Preceding OpId in replica: term: 15 index:
> 93. Preceding OpId from leader: term: 17 index: 112. (index mismatch)
> I1028 08:26:07.301476 38436 raft_consensus.cc:2819] T
> a6bfa86e43b74cfaa6feba4631879251 P f344ade17ed94072b8839007ccc7570a [term 17
> NON_PARTICIPANT]: Allowing unsafe config change even though there is a
> pending config! Existing pending config: opid_index: 95 OBSOLETE_local: false
> peers { permanent_uuid: "7875cc5598a44bd893998cba7bd2cc47" member_type: VOTER
> last_known_addr{ host: "foo01.server.net" port: 7050 }
> attrs{ promote: false }
> } peers { permanent_uuid: "f1a7fb14b7b44a5c8b31e93114d79a8d" member_type:
> NON_VOTER last_known_addr{ host: "foo04.server.net" port: 7050 }
> attrs{ promote: true }
> } unsafe_config_change: true; New pending config: opid_index: 96
> OBSOLETE_local: false peers { permanent_uuid:
> "7875cc5598a44bd893998cba7bd2cc47" member_type: VOTER last_known_addr{ host:
> "foo01.server.net" port: 7050 }
> attrs{ promote: false }
> } peers { permanent_uuid: "f1a7fb14b7b44a5c8b31e93114d79a8d" member_type:
> NON_VOTER last_known_addr{ host: "foo04.server.net" port: 7050 }
> attrs{ promote: true }
> } peers { permanent_uuid: "231e6fdad22647978c9a76c07407da4c" member_type:
> NON_VOTER last_known_addr{ host: "foo02.server.net" port: 7050 }
> attrs{ promote: true }
> } unsafe_config_change: true
> F1028 08:26:07.302338 38436 pending_rounds.cc:179] Check failed: _s.ok() Bad
> status: Corruption: New operation's term is not >= than the previous op's
> term. Current: 14.94. Previous: 15.93
> {code}
> It seems like the tool is permitting the persistence of a bad op, considering
> there's already a config change in flight.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)