Re: [DISCUSS] KIP-143: Controller Health Metrics

Onur Karaman Mon, 08 May 2017 19:25:10 -0700

I had a similar comment to Becket but accidentally posted it on the vote
thread last Friday. From that thread:
"I noticed that both the ControllerState metric and the *RateAndTimeMs metrics
only cover a subset of the controller event types. Was this intentional?"


I think it makes most sense to just have the ControllerState metric and
*RateAndTimeMs metrics exactly match up with the event types.

On Mon, May 8, 2017 at 4:35 PM, Ismael Juma <ism...@juma.me.uk> wrote:

> Hi Becket,
>
> Thanks for the feedback. Comments inline.
>
> On Tue, May 9, 2017 at 12:19 AM, Becket Qin <becket....@gmail.com> wrote:
>>
>> q10. With event loop based controller design, it seems natrural to have
>> the
>> processing time for each controller event type. In that case, the current
>> metrics seem not covering all the event processing time? e.g. (broker
>> joining and broker failure event handling time).
>>
>
> I'll leave it to Jun to explain why some metrics were not included in the
> proposal.
>
> q11. There are actually a couple of stages for controller state transition
>> and propagation. More specifically:
>> Stage 1: Event queue time
>> Stage 2: Event process time (including all the zk path updates)
>> Stage 3: State propagation time (including the state propagation queuing
>> time on the controller and the actual request sent to response receive
>> time)
>>
>> I think it worth to have metrics for each of the stage. Stage 3 might be a
>> little tricky because we may need to potentially manage the per broker
>> propagation time. Arguably the state propagation time is a little
>> overlapping with the broker side request handling time, but for some state
>> change that involves multiple types of requests, it could still be useful
>> to know what is the time for a state transition to be propagated to the
>> entire cluster.
>>
>
> Can you please elaborate on how you would like to see this measured. Do
> you mean that each event would have a separate metric for each of these
> stages? Maybe a worked out example would make it clear.
>
> q11. Regarding (13), controller actually do not update the ISR but just
>> read it. The error message seems from the brokers. It usually happens in
>> the following case:
>> 1. Current leader broker has the cached ZNode version V
>> 2. The controller elected a new leader and updated the ZNode, now ZNode
>> version becomes V+1,
>> 3. Controller sends the LeaderAndIsrRequest to the replica brokers to
>> propagate the new leader information as well as the new zkVersion.
>> 4. Before the old leader process the LeaderAndIsrRequest from controller
>> in
>> step 3, The old leader tries to update ISR and found the cached zkVersion
>> V
>> is different from the actual zkVersion V+1.
>>
>> It looks that this is not a controller metric.
>
>
> Yes, it's not a controller metric, that's why it's under the "Partition
> Metrics" section. In the PR, I actually implemented it in ReplicaManager
> alongside IsrExpandsPerSec and IsrShrinksPerSec.
>
> Thanks,
> Ismael
>

Re: [DISCUSS] KIP-143: Controller Health Metrics

Reply via email to