[ 
https://issues.apache.org/jira/browse/KAFKA-19850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luke Chen resolved KAFKA-19850.
-------------------------------
    Fix Version/s:     (was: 4.2.0)
       Resolution: Not A Problem

Summary of what we've done for this issue. In short, we need a KIP to perfectly 
resolve this issue. So far, relying on manual join should be good enough to 
workaround it when encountering this issue.

*To fix this problem, these solutions are proposed:*
1. A node cannot be auto-join within a configurable timeout after removal

Once a node is removed, there will be a timer enabled. Within the timer 
expired, the node will not auto-join into the voter set. 

The problem with this solution is it’s hard to set a “good timeout” for it. For 
example, if there are only 3 controller nodes, a 5 minute timer should be good 
enough. But if it’s 10+ controller nodes, it might take more than 10 minutes to 
complete the pod rolling which implies that one or more of the latest removed 
controllers could auto-join again. 

2. A node will be auto-joined only when node startup

The proposed solution can resolve the problem if the node keeps alive. It will 
look like this:
 a. Once the controller is removed from voters set, it won't be auto joined 
even if `controller.quorum.auto.join.enable=true`
b. The controller can be manually join the voters in this state
c. The controller node will auto join the voters set after the node restarted.

To achieve this, we need to know if “this node” is the voter when startup. If 
so, we can track if this node becomes an observer (leaves the voter set) 
anytime later and then avoid its auto-joining. The problem is there is not a 
good way to get the voters set “on startup time”. KRaft replication protocol is 
a pull-based protocol. The follower/observer’s job is continuing to fetch from 
the leader node. When a node is lagging behind the leader, after startup, the 
voters set state is stale. And during catching up with the leader, it has no 
such information telling you “offset xxx is the startup offset”, and you should 
base on this state to decide if this node can be auto-join or not. Because we 
cannot have the voters set state when startup, we cannot achieve the goal: A 
node will be auto-joined only when node startup.
Besides, another problem of this solution is the state is not persistent, so 
that once a node is restarted, it will auto join the voters. But we cannot 
promise a node won’t crash any time. If the node is removed from the voters, 
before the operator deletes the node, it crashes and restarts, then it’ll be 
auto-joining the voters. It will cause a zombie voter in the metadata.

3. To improve the problem we have on solution (2), we change the semantic to: 

New controllers (node id + directory UUID tuples) will automatically join the 
KRaft voter set once if they have not been a voter before.

The improvement is that we do the checking in the leader controller side. Doing 
the checking in the leader means the data is persistent. We always rely on the 
latest metadata state to decide if a node can be auto-join or not. To do this, 
we just need to check the history of metadata and find out if a node has ever 
been in the voter set. 

The problem of this solution is that the metadata log will do snapshotting to 
save the disk space. And the snapshot only contains “the latest metadata state 
at the time of snapshotting”. This causes the “history of metadata” is no 
longer complete and the check if a node has been in the voter set or not is not 
available, either.


After the long discussion, we think so far, there is not a good solution to 
resolve this issue without a KIP to improve it. So, if users facing this issue, 
they might need to come up with a solution without relying on the auto-join 
feature to scale up/down the voter set, i.e. implementing controller addition 
and removal in the orchestration layer using Admin:addRaftVoter, 
Admin::removeRaftVoter and Admin::describeMetadataQuorum.
 


> KRaft voter auto join will add a removed voter immediately
> ----------------------------------------------------------
>
>                 Key: KAFKA-19850
>                 URL: https://issues.apache.org/jira/browse/KAFKA-19850
>             Project: Kafka
>          Issue Type: Improvement
>    Affects Versions: 4.2.0
>            Reporter: Luke Chen
>            Priority: Major
>
> In v4.2.0, we are able to auto join a controller with the configuration 
> `controller.quorum.auto.join.enable=true` set 
> ([KIP-853|https://cwiki.apache.org/confluence/display/KAFKA/KIP-853%3A+KRaft+Controller+Membership+Changes#KIP853:KRaftControllerMembershipChanges-Controllerautojoining](KAFKA-19078)).
>  This is a good improvement for controller addition, but it has a UX issue, 
> which is that when a controller is removed via removeVoterRequest, it will be 
> added immediately due to `controller.quorum.auto.join.enable=true`. In the 
> KIP, we also mention you have to stop the controller before removing the 
> controller:
>  
> {noformat}
> controller.quorum.auto.join.enable:
> Controls whether a KRaft controller should automatically join the cluster 
> metadata partition for its cluster id. If the configuration is set to 
> true the controller must be stopped before removing the controller with 
> kafka-metadata-quorum remove-controller.{noformat}
>  
> This "shutdown the to-be-removed controller first" operation might break the 
> quorum in the worst case. For example, 3 controller nodes quorum (C1, C2, 
> C3), C1 is the leader, C3 is already caught up with C1, C2 is still catching 
> up with the leader. When users want to remove C3, following the guide, users 
> shutdown the C3 first. But at this point of time, the quorum is broken and 
> the kafka cluster is basically unavailable.
> Furthermore, this is not a user friendly behavior. And it will cause many 
> confusion to users and thought there is something wrong in the controller 
> removal. Besides, In the kubernetes environment which is controlled by the 
> operator, it is not the cloud native way to shutdown a node, do some 
> operation, then start it up.
>  
> So, I propose we can improve it by "the removed controller will not be auto 
> joined before this controller restarted". That is:
> 1. Once the controller is removed from voters set, it won't be auto joined 
> even if `controller.quorum.auto.join.enable=true`
> 2. The controller can be manually join the voters in this state
> 3. The controller node will be auto join the voters set after node restarted.
>  
> So in short, the semantics of auto join is updated as "a node will be 
> auto-joined only when node startup". I think it makes more sense to users. 
> Thoughts?
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to