Re: [DISCUSS] KIP-1073 Return inactive observer nodes in DescribeQuorum response

Gantigmaa Selenge Wed, 28 Aug 2024 03:34:09 -0700

Thank you both for reviewing the KIP again.

Addressing Colin's comments:

 > The admin client changes don't seem to include a way to find out if a
given broker is fenced, by examining the result. Seems like we need that as
well?

I think admins usually send describeCluster requests to find out about
cluster's node ids and their descriptions. So I'm not sure if we really
need the ability to describe a specific node as they would still need to
send a request to find out the node id first, which could already include
the fenced state. However, if you think it's useful to only describe a
particular node, we can add that to the admin client changes. I guess we
would need to add another field to the DescribeClusterOptions to specify a
node id so that the response returns the description for that node only
(ignoring rest of the nodes in the result returned from broker)?

> You need to describe what happens when the client requests to see fenced
brokers, but the protocol doesn't support that. I would suggest
UnsupportedVersionException.

With Jose's suggestion to remove the includeFencedBroker option, we might
not need this exception anymore. Admin clients connecting to older brokers
would simply not receive the fenced state of the nodes.

Addressing Jose's comments:

> Did you consider making "includeFencedBroker" field value implicitly
based on the DescribeClusterRequest version and the EndpointType? For
example, if the request version is 2 and the endpoint type is broker,
the response will include all of the brokers and if they are fenced or
unfenced. I am suggesting this because a request of endpointType
controller and includeFencedBroker true doesn't make sense. In
general, users of the AdminClient cannot easily discover the API
versions supported by the destination node since the destination node
may not be known when the admin call is made.

This is a good point! I agree that endpointType controller and
includeFencedBroker true doesn't make sense. I thought the
includeFencedBroker option would be ignored since there is no fenced state
in the ControllerRegistration. So I think what you are suggesting is that
instead of having the includeFencedBroker option in the request, we would
return all broker nodes and their description including the fenced state,
if the request version is 2. I guess if endpointType is controller,
controller nodes will be returned with a fenced state set to false (it is a
boolean field in Node class). I'm happy to go with this approach because I
think it is even simpler.

> It is unfortunate that this RPC is handled using the metadata cache.
That means that the user cannot find out when was the last time the
broker contacted the controller cluster. Knowing that a broker is
fenced is not enough information to determine to unregister a broker.
If the user knows that the broker has been fenced for hours or day, it
may be enough information to decide to unregister the cluster.

I understand that we can't determine how long a broker node has been in a
fenced state with the API. The motivation of this proposal is to help
admins to find out which broker is fenced, when they are performing scale
down operation. Nodes wouldn't usually get unregistered unless the
cluster is being scaled down, however admin could also find out the id of a
fenced broker using the DescribeCluster API and use it to check the metrics
such as lastFetchTimeStamp to determine how longs it's been fenced.

> For this command to work, the user must pass --bootstrap-server. If
the user passes --bootstrap-controller, the controller nodes will get
listed instead. Are you planning to also implement "kafka-cluster
--bootstrap-controller <endpoints> list-controllers"? The other option
is to make the command "list-nodes" and it will list the controllers
or brokers based on the bootstrap server used. Or are you planning to
fix the controller side and broker side implementation so that
controller nodes can return registered brokers and broker nodes can
return registered controllers?

I was planning to limit the "list-brokers" command only to be used with
--bootstrap-server. If the command was used with --bootstrap-controller,
the user would get an error back. I should have made it clear in the KIP.
However, thinking about this now, if we go ahead with the suggestion from
your first point, I think we can change the command to "list-nodes" and
output either controller nodes or broker nodes depending on the bootstrap
argument. The output will have a fenced state printed only if
--bootstrap-server is used. What do you think?

> Is the tool going to print the string "→ (Broker 2 has shutdown but
still registered)" to stdout?

Sorry, that was meant to be just a comment clarifying the new row, which
may not be necessary. I will remove it.

Regards,
Gantigmaa

On Mon, Aug 26, 2024 at 3:19 PM José Armando García Sancio
<jsan...@confluent.io.invalid> wrote:

> Thanks for the KIP Gantigmaa,
>
> > { "name": "IncludeFencedBrokers", "type": "bool", "versions": "2+",
> "about": "Whether to include fenced brokers." }
>
> Did you consider making "includeFencedBroker" field value implicitly
> based on the DescribeClusterRequest version and the EndpointType? For
> example, if the request version is 2 and the endpoint type is broker,
> the response will include all of the brokers and if they are fenced or
> unfenced. I am suggesting this because a request of endpointType
> controller and includeFencedBroker true doesn't make sense. In
> general, users of the AdminClient cannot easily discover the API
> versions supported by the destination node since the destination node
> may not be known when the admin call is made.
>
> > { "name": "Fenced", "type": "bool", "versions": "2+", "about": "Whether
> the broker is fenced." }
>
> It is unfortunate that this RPC is handled using the metadata cache.
> That means that the user cannot find out when was the last time the
> broker contacted the controller cluster. Knowing that a broker is
> fenced is not enough information to determine to unregister a broker.
> If the user knows that the broker has been fenced for hours or day, it
> may be enough information to decide to unregister the cluster.
>
> > kafka-cluster.sh --bootstrap-server localhost:9092 list-brokers
>
> For this command to work, the user must pass --bootstrap-server. If
> the user passes --bootstrap-controller, the controller nodes will get
> listed instead. Are you planning to also implement "kafka-cluster
> --bootstrap-controller <endpoints> list-controllers"? The other option
> is to make the command "list-nodes" and it will list the controllers
> or brokers based on the bootstrap server used. Or are you planning to
> fix the controller side and broker side implementation so that
> controller nodes can return registered brokers and broker nodes can
> return registered controllers?
>
> I guess the other option is to limit the kafka-cluster tool for
> brokers and they must use the kafka-metadata-quorum tool for
> controllers. What do you think?
>
> > 2          broker-2  9092       fenced   → (Broker 2 has shutdown but
> still registered)
>
> Is the tool going to print the string "→ (Broker 2 has shutdown but
> still registered)" to stdout?
>
> Thanks,
> -José
>
>

Re: [DISCUSS] KIP-1073 Return inactive observer nodes in DescribeQuorum response

Reply via email to