Hey Jose, Yeah, that was an initial discussion point that isn't going to be implemented. I'll move it to "rejected alternatives" and remove the "proposed changes" section. Thanks for the feedback.
Best, Kevin On Mon, Apr 14, 2025 at 4:31 PM Kevin Wu <kevin.wu2...@gmail.com> wrote: > Hey Colin, > > > How about something like this? > 10 = fenced > 20 = controlled shutdown > > 30 = active > > Yeah, that seems reasonable to me. Thanks for the suggestion. > > Kevin > > > > On Mon, Apr 14, 2025 at 12:42 PM Kevin Wu <kevin.wu2...@gmail.com> wrote: > >> Thanks for the comments Federico. >> >> > If I understand correctly unfenced == active. In the code we always >> > use the term active, so I think it would be better to use that for the >> > state 0 description. >> I've updated the KIP description to refer to "active". >> >> > You propose creating per-broker metrics indicating their state > >> (BrokerRegistrationState.kafka-X). Can't these new metrics be used to > >> derive broker counters in whatever monitor tool you decide to use? I > >> mean, we wouldn't need to store and provide > ControlledShutdownBrokerCount >> (proposed), FencedBrokerCount > (existing), ActiveBrokerCount (existing). >> Yes, we can use this new metric to derive broker counters, but it's just >> more complicated for the operator to implement. Also, I don't think it's a >> huge issue that there's a slight redundancy here, since deleting the >> existing metrics will lead to compatibility issues with current monitoring >> setups. >> >> On Mon, Apr 14, 2025 at 12:25 PM Kevin Wu <kevin.wu2...@gmail.com> wrote: >> >>> Thanks for the comments Jose. >>> For 1 and 2, I've changed the naming of the metrics to follow your >>> suggestion of tags/attributes. For 3, I made a note as to why we need the >>> maximum. Basically, it's because the map that contains broker contact times >>> we're using as the source for these metrics removes entries when a broker >>> is fenced. Therefore, we need some default value when the entry doesn't >>> exist for the broker, but it is still registered. >>> >>> Thanks, >>> Kevin >>> >>> > Thanks for the improvement Kevin. I got a chance to look at the KIP. >>> > >>> > 1. >>> kafka.controller:type=KafkaController,name=BrokerRegistrationState.kafka-X >>> > >>> > Can we use tags or attributes instead of different names? For example, >>> > how about >>> kafka.controller:type=KafkaController,name=BrokerRegistrationState,broker=X >>> > where X is the node id? >>> > >>> > 2. >>> kafka.controller:type=KafkaController,name=TimeSinceLastHeartbeatReceivedMs.kafka-X >>> > >>> > Same here, did you consider using tags or attributes for the node id? >>> > >>> > 3. For the metrics >>> > >>> kafka.controller:type=KafkaController,name=TimeSinceLastHeartbeatReceivedMs.kafka-X, >>> > you mentioned that you will limit the value to the heartbeat timeout. >>> > Why? Wouldn't it be a useful report the entire time since the last >>> > heartbeat? That is more information instead of just reporting the >>> > value up to the heartbeat timeout. >>> > >>> > Thanks, >>> > -- >>> > -José >>> > >>> >>> On Thu, Mar 6, 2025 at 1:58 PM Kevin Wu <kevin.wu2...@gmail.com> wrote: >>> >>>> That's an interesting idea. However, I think that's going to be messy >>>>> and difficult for people to use. For example, how would you set up Grafana >>>>> or Datadog to use this? The string could also get extremely long (imagine >>>>> 1000 brokers all in startup.) >>>> >>>> Hmm... Yeah from what I've read so far setting this up might be kind of >>>> challenging. I'm not seeing that OTEL supports gauges for string values. >>>> >>>> I'm still a little confused as to why having a per-broker metric to >>>> expose its state is preferred, but I think this is at least part of the >>>> reason? When drafting this KIP, I was only really considering the scenarios >>>> of the broker's initial metadata load during startup and their controlled >>>> shutdown, which my proposed metrics would cover. However, there are a lot >>>> of other scenarios with fenced brokers which have already started up that >>>> the existing fencedBrokers metric doesn't really give enough information >>>> about from the controller-side, since it just reports the number. For these >>>> scenarios, I don't think my proposed startup/shutdown focused metrics would >>>> be very useful. >>>> I'm on board with the proposed per-broker metric that exposes its >>>> state. I think it would be helpful to enumerate some specific cases though >>>> for the KIP. >>>> >>>> On Thu, Feb 27, 2025 at 2:19 PM Kevin Wu <kevin.wu2...@gmail.com> >>>> wrote: >>>> >>>>> I guess my concern is that the time-based metrics would reset to 0 on >>>>>> every failover (if I understand the proposed implementation correctly). >>>>>> That seems likely to create confusion. >>>>> >>>>> Yeah that makes sense to me. I'm fine with moving towards the approach >>>>> of either (since I don't think we need both): >>>>> >>>>> - Exposing the number of brokers in 1. startup, 2. fenced (what we >>>>> have now), and 3. in controlled shutdown >>>>> - Exposing a per-broker metric reflecting the state of the broker >>>>> (more on this below). >>>>> >>>>> I think it would be useful to have a state for each broker exposed as >>>>>> a metric. I can think of a lot of scenarios where this would be useful to >>>>>> have. I don't think we should have more than one metric per broker >>>>>> though, >>>>>> if we can help it. >>>>> >>>>> Instead of having exactly a per-broker metric which exposes a number >>>>> that maps to a state (0, 1, 2, and 3), what if we expose 4 metrics whose >>>>> values are a comma-delimited string of the brokers in those states. >>>>> Something along the lines of: >>>>> >>>>> - Metric: name = BrokersNotRegistered, value = "kafka-1" >>>>> - Metric: name = BrokersRegisteredAndNeverUnfenced, value = >>>>> "kafka-2" >>>>> - Metric: name = BrokersRegisteredAndFenced, value = >>>>> "kafka-2,kafka-3" >>>>> - Metric: name = BrokersRegisteredRegisteredAndUnfenced, value = >>>>> "kafka-4,kafka-5" >>>>> >>>>> I guess there will be overlap between the second and third metrics, >>>>> but there do exist metrics that expose `Gauge<String>` values. >>>>> >>>>> On Tue, Feb 25, 2025 at 4:12 PM Kevin Wu <kevin.wu2...@gmail.com> >>>>> wrote: >>>>> >>>>>> Hey Colin, >>>>>> >>>>>> Thanks for the review. >>>>>> >>>>>> Regarding the metrics that reflect times: my initial thinking was to >>>>>> indeed have these be "soft state", which would be reset when a controller >>>>>> failover happens. I'm not sure if it's a big issue if these values get >>>>>> reset though, since a controller failover means brokers in startup would >>>>>> need to register again to the new controller anyways. Since what we're >>>>>> trying to monitor with these metrics is the broker's startup and shutdown >>>>>> statuses from the controller's view, my thinking was that exposing this >>>>>> soft state would be appropriate. >>>>>> >>>>>> There exist metrics that expose other soft state of the controller in >>>>>> `QuorumControllerMetrics.java`, and I thought the proposed metrics here >>>>>> would fit with what exists there. If instead we're updating these metrics >>>>>> based on the metadata log events for registration changes, it looks like >>>>>> `ControllerMetadataMetrics` has a `FencedBrokerCount` metric, and I guess >>>>>> we could add a `ControlledShutdownBrokerCount`. For specifically tracking >>>>>> brokers in their initial startup fencing using the log events, I'm not >>>>>> totally sure as of now how we can actually do this from only the >>>>>> information in `BrokerRegistration`. I guess we know a broker is >>>>>> undergoing >>>>>> startup when it's fenced and has an `incarnationId` the controller hasn't >>>>>> seen before in the log? >>>>>> >>>>>> Regarding the per-broker metrics, what are your thoughts about the >>>>>> metric cardinality of this? There was some discussion about having a >>>>>> startup/shutdown time per-broker and I pushed back against it because the >>>>>> number of metrics we expose as a result is the number of brokers in the >>>>>> cluster. Additionally, I don't think the controller can know of a live >>>>>> broker that has not attempted to register yet in order to make a metric >>>>>> for >>>>>> it and assign it a value of 0. Is a value of 0 for brokers that shutdown? >>>>>> In that case, doesn't that make the metric cardinality worse? I think if >>>>>> we >>>>>> decide to go that route we should only have states 1, 2, and 3. >>>>>> >>>>>> Best, >>>>>> Kevin Wu >>>>>> >>>>>> On Mon, Jan 27, 2025 at 12:56 PM Kevin Wu <kevin.wu2...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Hey all, >>>>>>> >>>>>>> I posted a KIP to monitor broker startup and controlled shutdown on >>>>>>> the controller-side. Here's the link: >>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1131%3A+Controller-side+monitoring+for+broker+shutdown+and+startup >>>>>>> >>>>>>> Best, >>>>>>> Kevin Wu >>>>>>> >>>>>>