Hi Ismael and Jun,

Thanks for the KIP. It is very useful. A couple of comments:

q10. With event loop based controller design, it seems natrural to have the
processing time for each controller event type. In that case, the current
metrics seem not covering all the event processing time? e.g. (broker
joining and broker failure event handling time).

q11. There are actually a couple of stages for controller state transition
and propagation. More specifically:
Stage 1: Event queue time
Stage 2: Event process time (including all the zk path updates)
Stage 3: State propagation time (including the state propagation queuing
time on the controller and the actual request sent to response receive time)

I think it worth to have metrics for each of the stage. Stage 3 might be a
little tricky because we may need to potentially manage the per broker
propagation time. Arguably the state propagation time is a little
overlapping with the broker side request handling time, but for some state
change that involves multiple types of requests, it could still be useful
to know what is the time for a state transition to be propagated to the
entire cluster.

q11. Regarding (13), controller actually do not update the ISR but just
read it. The error message seems from the brokers. It usually happens in
the following case:
1. Current leader broker has the cached ZNode version V
2. The controller elected a new leader and updated the ZNode, now ZNode
version becomes V+1,
3. Controller sends the LeaderAndIsrRequest to the replica brokers to
propagate the new leader information as well as the new zkVersion.
4. Before the old leader process the LeaderAndIsrRequest from controller in
step 3, The old leader tries to update ISR and found the cached zkVersion V
is different from the actual zkVersion V+1.

It looks that this is not a controller metric. I kind of think that if we
can have good metrics on the controller side to include the end to end
state propagation (i.e. until the LeaderAndIsrResponse is received), we
don't need to worry about this intermittent ISR update failure. If that
keeps happening, we should see a long state propagation time reported by
the controller or a long LeaderAndIsrRequest queue time.

Thanks,

Jiangjie (Becket) Qin

On Mon, May 8, 2017 at 2:47 PM, Joel Koshy <jjkosh...@gmail.com> wrote:

> Hi Ismael,
>
>
> > > What about a broker that is not the controller? Would you need a
> separate
> > > idle-not-controller state?
> >
> >
> > Do we need a separate state or can users just use the
> ActiveControllerCount
> > metric to check if the broker is the controller?
> >
>
> Sure - the ACC metric should be sufficient.
>
>
> > > Given that most of the state changes are short
> > > we would just see blips in the best case and nothing in the worst case
> > > (depending on how often metrics get sampled). It would only help if you
> > > want to visually detect any transitions that are taking an inordinate
> > > duration.
> > >
> >
> > Right. Do you think this is not worth it?
> >
>
> Not sure - I felt the RateAndTime sensors are sufficient and would give
> more intuitive results. E.g., users would emit the state metric as a gauge
> but the underlying metric reporter would need to sample the metric
> frequently enough to capture state changes. i.e., for most metrics backends
> you would likely see a flat line even through a series of (fast) state
> changes.
>
>
> >
> > Ok - my thought was since we are already using kafka-metrics for quotas
> and
> > > selector metrics we could just do the same for this (and any *new*
> > metrics
> > > on the broker).
> > >
> >
> > I don't have a strong opinion either way as I think both options have
> pros
> > and cons. It seems like we don't have the concept of a Gauge in Kafka
> > Metrics at the moment, so it seems like it would be easier to do it via
> > Yammer metrics while we discuss what's the best way to unify all the
> > metrics again.
> >
>
> Sounds good.
>

Reply via email to