Thank you both for reviewing the KIP again. Addressing Colin's comments:
> The admin client changes don't seem to include a way to find out if a given broker is fenced, by examining the result. Seems like we need that as well? I think admins usually send describeCluster requests to find out about cluster's node ids and their descriptions. So I'm not sure if we really need the ability to describe a specific node as they would still need to send a request to find out the node id first, which could already include the fenced state. However, if you think it's useful to only describe a particular node, we can add that to the admin client changes. I guess we would need to add another field to the DescribeClusterOptions to specify a node id so that the response returns the description for that node only (ignoring rest of the nodes in the result returned from broker)? > You need to describe what happens when the client requests to see fenced brokers, but the protocol doesn't support that. I would suggest UnsupportedVersionException. With Jose's suggestion to remove the includeFencedBroker option, we might not need this exception anymore. Admin clients connecting to older brokers would simply not receive the fenced state of the nodes. Addressing Jose's comments: > Did you consider making "includeFencedBroker" field value implicitly based on the DescribeClusterRequest version and the EndpointType? For example, if the request version is 2 and the endpoint type is broker, the response will include all of the brokers and if they are fenced or unfenced. I am suggesting this because a request of endpointType controller and includeFencedBroker true doesn't make sense. In general, users of the AdminClient cannot easily discover the API versions supported by the destination node since the destination node may not be known when the admin call is made. This is a good point! I agree that endpointType controller and includeFencedBroker true doesn't make sense. I thought the includeFencedBroker option would be ignored since there is no fenced state in the ControllerRegistration. So I think what you are suggesting is that instead of having the includeFencedBroker option in the request, we would return all broker nodes and their description including the fenced state, if the request version is 2. I guess if endpointType is controller, controller nodes will be returned with a fenced state set to false (it is a boolean field in Node class). I'm happy to go with this approach because I think it is even simpler. > It is unfortunate that this RPC is handled using the metadata cache. That means that the user cannot find out when was the last time the broker contacted the controller cluster. Knowing that a broker is fenced is not enough information to determine to unregister a broker. If the user knows that the broker has been fenced for hours or day, it may be enough information to decide to unregister the cluster. I understand that we can't determine how long a broker node has been in a fenced state with the API. The motivation of this proposal is to help admins to find out which broker is fenced, when they are performing scale down operation. Nodes wouldn't usually get unregistered unless the cluster is being scaled down, however admin could also find out the id of a fenced broker using the DescribeCluster API and use it to check the metrics such as lastFetchTimeStamp to determine how longs it's been fenced. > For this command to work, the user must pass --bootstrap-server. If the user passes --bootstrap-controller, the controller nodes will get listed instead. Are you planning to also implement "kafka-cluster --bootstrap-controller <endpoints> list-controllers"? The other option is to make the command "list-nodes" and it will list the controllers or brokers based on the bootstrap server used. Or are you planning to fix the controller side and broker side implementation so that controller nodes can return registered brokers and broker nodes can return registered controllers? I was planning to limit the "list-brokers" command only to be used with --bootstrap-server. If the command was used with --bootstrap-controller, the user would get an error back. I should have made it clear in the KIP. However, thinking about this now, if we go ahead with the suggestion from your first point, I think we can change the command to "list-nodes" and output either controller nodes or broker nodes depending on the bootstrap argument. The output will have a fenced state printed only if --bootstrap-server is used. What do you think? > Is the tool going to print the string "→ (Broker 2 has shutdown but still registered)" to stdout? Sorry, that was meant to be just a comment clarifying the new row, which may not be necessary. I will remove it. Regards, Gantigmaa On Mon, Aug 26, 2024 at 3:19 PM José Armando García Sancio <jsan...@confluent.io.invalid> wrote: > Thanks for the KIP Gantigmaa, > > > { "name": "IncludeFencedBrokers", "type": "bool", "versions": "2+", > "about": "Whether to include fenced brokers." } > > Did you consider making "includeFencedBroker" field value implicitly > based on the DescribeClusterRequest version and the EndpointType? For > example, if the request version is 2 and the endpoint type is broker, > the response will include all of the brokers and if they are fenced or > unfenced. I am suggesting this because a request of endpointType > controller and includeFencedBroker true doesn't make sense. In > general, users of the AdminClient cannot easily discover the API > versions supported by the destination node since the destination node > may not be known when the admin call is made. > > > { "name": "Fenced", "type": "bool", "versions": "2+", "about": "Whether > the broker is fenced." } > > It is unfortunate that this RPC is handled using the metadata cache. > That means that the user cannot find out when was the last time the > broker contacted the controller cluster. Knowing that a broker is > fenced is not enough information to determine to unregister a broker. > If the user knows that the broker has been fenced for hours or day, it > may be enough information to decide to unregister the cluster. > > > kafka-cluster.sh --bootstrap-server localhost:9092 list-brokers > > For this command to work, the user must pass --bootstrap-server. If > the user passes --bootstrap-controller, the controller nodes will get > listed instead. Are you planning to also implement "kafka-cluster > --bootstrap-controller <endpoints> list-controllers"? The other option > is to make the command "list-nodes" and it will list the controllers > or brokers based on the bootstrap server used. Or are you planning to > fix the controller side and broker side implementation so that > controller nodes can return registered brokers and broker nodes can > return registered controllers? > > I guess the other option is to limit the kafka-cluster tool for > brokers and they must use the kafka-metadata-quorum tool for > controllers. What do you think? > > > 2 broker-2 9092 fenced → (Broker 2 has shutdown but > still registered) > > Is the tool going to print the string "→ (Broker 2 has shutdown but > still registered)" to stdout? > > Thanks, > -José > >