Re: [DISCUSS] KIP-143: Controller Health Metrics

Jun Rao Mon, 01 May 2017 18:36:07 -0700

Hi, Joel,

1. Yes, the long term goal is to migrate the metrics on the broker to
kafka-metrics. Since many people are using Yammer reporters, we probably
need to support a few popular ones in kafka-metrics before migrating. Until
that happens, we probably want to stick with the Yammer metrics on the
server side unless we depend on features from kafka-metrics (e.g, quota).


2. Thanks for Onur, we now have moved to a single threaded model. So, none
of the event in metric #1 will overlap.

3. Metric #6 just track the rate/latency each time the controller is called
to initiate or resume the processing of a reassginment request.

4. Metrics #2 and #3. The issue with relying on metric #1 is that the
latter is sensitive to the frequency of metric collection. For example, if
the starting of the controller takes 30 secs and the metric is only
collected once a minute, one may not know the latency with just metric #1,
but will know the latency with metrics #2 and #3. Are you concerned about
the memory overhead of histograms? It doesn't seem that a couple of more
histograms will hurt.

5. Metric #9. Agreed. After KAFKA-5028, this will be reflected in the
remoteTimeMs of the controlled shutdown request.

6. Metrics #4 and #5. They are actually a bit different. The local time of
createTopic/deleteTopic just includes the time to add/delete the topic path
in ZK. The remote time includes the time that the controller processes the
request plus the time for the metadata to be propagated back to the
controller. So, knowing just the portion of the time spent in the
controller can still be useful.

Hi, Isamel,

Thanks the for proposal. A couple of more comments.,

10. It would be useful to add a new metrics for the controller queue size.
kafka.controller:type=ControllerStats,name=QueueSize

11. It would also be useful to know how long an event is waiting in the
controller queue before being processing. Perhaps, we can add a histogram
metric like the following.
kafka.controller:type=ControllerStats,name=QueueTimeMs

Jun

On Thu, Apr 27, 2017 at 11:39 AM, Joel Koshy <jjkosh...@gmail.com> wrote:

> Thanks for the KIP - couple of comments:
> - Do you intend to actually use yammer metrics? or use kafka-metrics and
> split the timer into an explicit rate and time? I think long term we ought
> to move off yammer and use kafka-metrics only. Actually either is fine, but
> we should ideally use only one in the long term - and I thought the plan
> was to use kafka-metrics.
> - metric #9 appears to be redundant since we already have per-API request
> rate and time metrics.
> - Same for metric #4, #5 (as there are request stats for
> DeleteTopicRequest - although it is possible for users to trigger deletes
> via ZK)
> - metric #2, #3 are potentially useful, but a bit overkill for a
> histogram. Alternative is to stick to last known value, but that doesn't
> play well with alerts if a high value isn't reset/decayed. Perhaps metric
> #1 would be sufficient to gauge slow start/resignation transitions.
> - metric #1 - some of the states may actually overlap
> - I don't actually understand the semantics of metric #6. Is it rate of
> partition reassignment triggers? does the number of partitions matter?
>
> Joel
>
> On Thu, Apr 27, 2017 at 8:04 AM, Tom Crayford <tcrayf...@heroku.com>
> wrote:
>
>> Ismael,
>>
>> Great, that sounds lovely.
>>
>> I'd like a `Timer` (using yammer metrics parlance) over how long it took
>> to
>> process the event, so we can get at p99 and max times spent processing
>> things. Maybe we could even do a log at warning level if event processing
>> takes over some timeout?
>>
>> Thanks
>>
>> Tom
>>
>> On Thu, Apr 27, 2017 at 3:59 PM, Ismael Juma <ism...@juma.me.uk> wrote:
>>
>> > Hi Tom,
>> >
>> > Yes, the plan is to merge KAFKA-5028 first and then use a lock-free
>> > approach for the new  metrics. I considered mentioning that in the KIP
>> > given KAFKA-5120, but didn't in the end. I'll add it to make it clear.
>> >
>> > Regarding locks, they are removed by KAFKA-5028, as you say. So, if I
>> > understand correctly, you are suggesting an event processing rate metric
>> > with event type as a tag? Onur and Jun, what do you think?
>> >
>> > Ismael
>> >
>> > On Thu, Apr 27, 2017 at 3:47 PM, Tom Crayford <tcrayf...@heroku.com>
>> > wrote:
>> >
>> > > Hi,
>> > >
>> > > We (Heroku) are very excited about this KIP, as we've struggled a bit
>> > with
>> > > controller stability recently. Having these additional metrics would
>> be
>> > > wonderful.
>> > >
>> > > I'd like to ensure polling these metrics *doesn't* hold any locks etc,
>> > > because, as noted in https://issues.apache.org/jira/browse/KAFKA-5120
>> ,
>> > > that
>> > > lock can be held for quite some time. This may become not an issue as
>> of
>> > > KAFKA-5028 though.
>> > >
>> > > Lastly, I'd love to see some metrics around how long the controller
>> > spends
>> > > inside its lock. We've been tracking an issue (
>> > > https://issues.apache.org/jira/browse/KAFKA-5116) where it can hold
>> the
>> > > lock for many, many minutes in a zk client listener thread when
>> > responding
>> > > to a single request. I'm not sure how that plays into
>> > > https://issues.apache.org/jira/browse/KAFKA-5028 (which I assume will
>> > land
>> > > before this metrics patch), but it feels like there will be equivalent
>> > > problems ("how long does it spend processing any individual message
>> from
>> > > the queue, broken down by message type").
>> > >
>> > > These are minor improvements though, the addition of more metrics to
>> the
>> > > controller is already going to be very helpful.
>> > >
>> > > Thanks
>> > >
>> > > Tom Crayford
>> > > Heroku Kafka
>> > >
>> > > On Thu, Apr 27, 2017 at 3:10 PM, Ismael Juma <ism...@juma.me.uk>
>> wrote:
>> > >
>> > > > Hi all,
>> > > >
>> > > > We've posted "KIP-143: Controller Health Metrics" for discussion:
>> > > >
>> > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-
>> > > > 143%3A+Controller+Health+Metrics
>> > > >
>> > > > Please take a look. Your feedback is appreciated.
>> > > >
>> > > > Thanks,
>> > > > Ismael
>> > > >
>> > >
>> >
>>
>
>

Re: [DISCUSS] KIP-143: Controller Health Metrics

Reply via email to