Re: [DISCUSS] KIP-143: Controller Health Metrics

Ismael Juma Mon, 08 May 2017 16:36:59 -0700

Hi Becket,

Thanks for the feedback. Comments inline.


On Tue, May 9, 2017 at 12:19 AM, Becket Qin <[email protected]> wrote:
>
> q10. With event loop based controller design, it seems natrural to have the
> processing time for each controller event type. In that case, the current
> metrics seem not covering all the event processing time? e.g. (broker
> joining and broker failure event handling time).
>

I'll leave it to Jun to explain why some metrics were not included in the
proposal.

q11. There are actually a couple of stages for controller state transition
> and propagation. More specifically:
> Stage 1: Event queue time
> Stage 2: Event process time (including all the zk path updates)
> Stage 3: State propagation time (including the state propagation queuing
> time on the controller and the actual request sent to response receive
> time)
>
> I think it worth to have metrics for each of the stage. Stage 3 might be a
> little tricky because we may need to potentially manage the per broker
> propagation time. Arguably the state propagation time is a little
> overlapping with the broker side request handling time, but for some state
> change that involves multiple types of requests, it could still be useful
> to know what is the time for a state transition to be propagated to the
> entire cluster.
>

Can you please elaborate on how you would like to see this measured. Do you
mean that each event would have a separate metric for each of these stages?
Maybe a worked out example would make it clear.

q11. Regarding (13), controller actually do not update the ISR but just
> read it. The error message seems from the brokers. It usually happens in
> the following case:
> 1. Current leader broker has the cached ZNode version V
> 2. The controller elected a new leader and updated the ZNode, now ZNode
> version becomes V+1,
> 3. Controller sends the LeaderAndIsrRequest to the replica brokers to
> propagate the new leader information as well as the new zkVersion.
> 4. Before the old leader process the LeaderAndIsrRequest from controller in
> step 3, The old leader tries to update ISR and found the cached zkVersion V
> is different from the actual zkVersion V+1.
>
> It looks that this is not a controller metric.


Yes, it's not a controller metric, that's why it's under the "Partition
Metrics" section. In the PR, I actually implemented it in ReplicaManager
alongside IsrExpandsPerSec and IsrShrinksPerSec.

Thanks,
Ismael

Re: [DISCUSS] KIP-143: Controller Health Metrics

Reply via email to