Re: [DISCUSS] KIP-188 - Add new metrics to support health checks

Rajini Sivaram Thu, 31 Aug 2017 13:57:50 -0700

Hi Jun,

Thank you, that makes sense. Updated the KIP.


Regards,

Rajini

On Thu, Aug 31, 2017 at 12:35 PM, Jun Rao <[email protected]> wrote:

> Hi, Rajini,
>
> Thanks for the updated KIP. Should be also track the conversion time for
> the producer? If so, the name can just be MessageConversionTimeMs and only
> the produce and the fetch request may have such a component.
>
> Jun
>
> On Wed, Aug 30, 2017 at 1:37 PM, Rajini Sivaram <[email protected]>
> wrote:
>
> > Jun,
> >
> > Thank you, your suggestions sound good. I have updated the KIP.
> >
> > Regards,
> >
> > Rajini
> >
> > On Tue, Aug 29, 2017 at 9:12 PM, Jun Rao <[email protected]> wrote:
> >
> > > Hi, Rajini,
> > >
> > > Thanks for the updated KIP. I agree that those additional metrics can
> be
> > > useful. I was thinking what would an admin do if the value of one of
> > > those metrics is abnormal. An admin probably want to determine which
> > client
> > > causes the abnormaly. So,  a couple of more comments below.
> > >
> > > 10. About FetchDownConversionsMs. Should we model it at the topic level
> > or
> > > at the request level as a new latency component like
> > messageConversionTime
> > > in addition to the existing localTime, requestQueueTime, etc? The
> benefit
> > > of the latter is that we can include it in the request log and use it
> to
> > > figure out which client is doing the conversion. Also, should we also
> > track
> > > the conversion time in the producer?
> > >
> > > 11. About ProduceBatchSize. Currently, the largest chunk of memory is
> > > allocated at the request level (mostly produce request). So, instead
> > > of ProduceBatchSize
> > > , perhaps we can add a request size metric for each type of request and
> > > also include it in the request log. This way, if there is a memory
> issue,
> > > we can trace it back to a particular client. Similarly, for
> > > ProduceUncompressedBatchSize,
> > > would it be better to track it at the request level as something like
> > > temporaryMemorySize and include it in the request log?
> > >
> > > Thanks,
> > >
> > > Jun
> > >
> > > On Tue, Aug 29, 2017 at 10:07 AM, Roger Hoover <[email protected]
> >
> > > wrote:
> > >
> > > > Great suggestions, Ismael and thanks for incorporating them, Rajini.
> > > >
> > > > Tracking authentication success and failures (#3) across listeners
> > seems
> > > > very useful for cluster administrators to identify misconfigured
> client
> > > or
> > > > bad actors, especially until all clients implement KIP-152 which will
> > add
> > > > an explicit error code for authentication failures.  Currently,
> clients
> > > > just get disconnected so it's hard to distinguish authentication
> > failures
> > > > from any other error that can cause disconnect.  This broker-side
> > metric
> > > is
> > > > useful regardless but can help fill this gap until all clients
> support
> > > KIP
> > > > 152.
> > > >
> > > > Just to be clear, the ones called `successful-authentication-rate`
> and
> > > > `failed-authentication-rate` will also have
> failed-authentication-count
> > > > and successful-authentication-count to match KIP 187?
> > > >
> > > > On Tue, Aug 29, 2017 at 7:26 AM, Rajini Sivaram <
> > [email protected]
> > > >
> > > > wrote:
> > > >
> > > > > Hi Ismael,
> > > > >
> > > > > Thank you for the suggestions. The additional metrics sound very
> > useful
> > > > and
> > > > > I have added them to the KIP.
> > > > >
> > > > > Regards,
> > > > >
> > > > > Rajini
> > > > >
> > > > > On Tue, Aug 29, 2017 at 5:34 AM, Ismael Juma <[email protected]>
> > > wrote:
> > > > >
> > > > > > Hi Rajini,
> > > > > >
> > > > > > There are a few other metrics that could potentially be useful.
> I'd
> > > be
> > > > > > interested in what you and the community thinks:
> > > > > >
> > > > > > 1. The KIP currently includes `FetchDownConversionsPerSec`, which
> > is
> > > > > > useful. In the common case, one would want to avoid down
> conversion
> > > by
> > > > > > using the lower message format supported by most of the
> consumers.
> > > > > However,
> > > > > > there are good reasons to use a newer message format even if
> there
> > > are
> > > > > some
> > > > > > legacy consumers around. It would be good to quantify the cost of
> > > these
> > > > > > consumers a bit more clearly. Looking at the request metric
> > > > `LocalTimeMs`
> > > > > > provides a hint, but it may be useful to have a dedicated
> > > > > > `FetchDownConversionsMs` metric.
> > > > > >
> > > > > > 2. Large messages can cause GC issues (it's particularly bad if
> > fetch
> > > > > down
> > > > > > conversion takes place). One can currently configure the max
> > message
> > > > > batch
> > > > > > size per topic to keep this under control, but that is the size
> > after
> > > > > > compression. However, we decompress the batch to validate produce
> > > > > requests
> > > > > > and we decompress and recompress during fetch downconversion). It
> > > would
> > > > > be
> > > > > > helpful to have topic metrics for the produce message batch size
> > > after
> > > > > > decompression (and perhaps compressed as well as that would help
> > > > > understand
> > > > > > the compression ratio).
> > > > > >
> > > > > > 3. Authentication success/failures per second. This is helpful to
> > > > > > understand if some clients are misconfigured or if bad actors are
> > > > trying
> > > > > to
> > > > > > authenticate.
> > > > > >
> > > > > > Thoughts?
> > > > > >
> > > > > > Ismael
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Wed, Aug 23, 2017 at 2:53 AM, Jun Rao <[email protected]>
> wrote:
> > > > > >
> > > > > > > Hi, Rajini,
> > > > > > >
> > > > > > > Yes, if those error metrics are registered dynamically, we
> could
> > > > worry
> > > > > > > about expiration later.
> > > > > > >
> > > > > > > Thanks,
> > > > > > >
> > > > > > > Jun
> > > > > > >
> > > > > > > On Fri, Aug 18, 2017 at 1:55 AM, Rajini Sivaram <
> > > > > [email protected]
> > > > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Perhaps we could register dynamically for now. If we find
> that
> > > the
> > > > > cost
> > > > > > > of
> > > > > > > > retaining these is high, we can add the code to expire them
> > > later.
> > > > Is
> > > > > > > that
> > > > > > > > ok?
> > > > > > > >
> > > > > > > > Regards,
> > > > > > > >
> > > > > > > > Rajini
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On Fri, Aug 18, 2017 at 9:41 AM, Ismael Juma <
> > [email protected]>
> > > > > > wrote:
> > > > > > > >
> > > > > > > > > Can we quantify the cost of having these metrics around if
> > they
> > > > are
> > > > > > > > > dynamically registered? Given that the maximum is bounded
> at
> > > > > > > development
> > > > > > > > > time, is it really worth all the extra code?
> > > > > > > > >
> > > > > > > > > Ismael
> > > > > > > > >
> > > > > > > > > On Fri, Aug 18, 2017 at 9:34 AM, Rajini Sivaram <
> > > > > > > [email protected]
> > > > > > > > >
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Jun,
> > > > > > > > > >
> > > > > > > > > > It feels more consistent to add errors as yammer metrics
> > > > similar
> > > > > to
> > > > > > > > other
> > > > > > > > > > request metrics. Perhaps we could add some code to track
> > and
> > > > > remove
> > > > > > > > these
> > > > > > > > > > if unused? It is a bit more work, but it would keep the
> > > > externals
> > > > > > > > > > consistent.
> > > > > > > > > >
> > > > > > > > > > Ismael/Manikumar,
> > > > > > > > > >
> > > > > > > > > > Agree that version as a String attribute makes more
> sense.
> > > > > > > > Unfortunately,
> > > > > > > > > > the whole KafkaMetric implementation is written around a
> > > single
> > > > > > > > "double"
> > > > > > > > > > type, so introducing a new type is a big change. But I
> > > suppose
> > > > it
> > > > > > can
> > > > > > > > be
> > > > > > > > > > done. I have updated the KIP.
> > > > > > > > > >
> > > > > > > > > > Regards,
> > > > > > > > > >
> > > > > > > > > > Rajini
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Fri, Aug 18, 2017 at 7:42 AM, Manikumar <
> > > > > > > [email protected]>
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > I agree it will be good if we can add  "commit
> > id/version"
> > > as
> > > > > an
> > > > > > > > > > > attribute value.
> > > > > > > > > > > It will be easy to parse. But as of now, KafkaMetric
> > > supports
> > > > > > only
> > > > > > > > > > > numerical values.
> > > > > > > > > > >
> > > > > > > > > > > On Fri, Aug 18, 2017 at 5:49 AM, Ismael Juma <
> > > > > [email protected]>
> > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Hi Rajini,
> > > > > > > > > > > >
> > > > > > > > > > > > About the gauges, I was thinking that the attribute
> > would
> > > > be
> > > > > > the
> > > > > > > > > value
> > > > > > > > > > > > (i.e. commit id or version). I understand that Kafka
> > > > Metrics
> > > > > > > > doesn't
> > > > > > > > > > > > support this (unlike Yammer Metrics), but would it
> make
> > > > sense
> > > > > > to
> > > > > > > > add?
> > > > > > > > > > > >
> > > > > > > > > > > > Ismael
> > > > > > > > > > > >
> > > > > > > > > > > > On Thu, Aug 17, 2017 at 2:54 PM, Rajini Sivaram <
> > > > > > > > > > [email protected]
> > > > > > > > > > > >
> > > > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > Hi Ismael,
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thank you for the review.
> > > > > > > > > > > > >
> > > > > > > > > > > > > 1. Agree on keeping it simple with dynamic
> > registration
> > > > and
> > > > > > no
> > > > > > > > > > expiry.
> > > > > > > > > > > > Will
> > > > > > > > > > > > > wait for Jun's feedback before updating KIP.
> > > > > > > > > > > > > 2. I have switched to two metrics for commit-id and
> > > > version
> > > > > > > (not
> > > > > > > > > sure
> > > > > > > > > > > if
> > > > > > > > > > > > it
> > > > > > > > > > > > > matches what you meant). I also added the client-id
> > tag
> > > > > which
> > > > > > > is
> > > > > > > > > used
> > > > > > > > > > > in
> > > > > > > > > > > > > all metrics from clients.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Regards,
> > > > > > > > > > > > >
> > > > > > > > > > > > > Rajini
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Thu, Aug 17, 2017 at 10:55 AM, Ismael Juma <
> > > > > > > [email protected]
> > > > > > > > >
> > > > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Thanks for the KIP, Rajini. I think this is
> helpful
> > > > too.
> > > > > A
> > > > > > > few
> > > > > > > > > > minor
> > > > > > > > > > > > > > comments.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > 1. About the number of metrics and expiration, if
> > we
> > > > > > > > dynamically
> > > > > > > > > > > > register
> > > > > > > > > > > > > > metrics for the error codes, the number is likely
> > to
> > > be
> > > > > > much
> > > > > > > > > lower
> > > > > > > > > > > than
> > > > > > > > > > > > > > 30*30, probably less than 100. If we were using
> > Kafka
> > > > > > Metrics
> > > > > > > > for
> > > > > > > > > > > this,
> > > > > > > > > > > > > we
> > > > > > > > > > > > > > could easily add a long expiration period to be
> > > > > > conservative,
> > > > > > > > > but I
> > > > > > > > > > > am
> > > > > > > > > > > > > not
> > > > > > > > > > > > > > sure this is supported by Yammer Metrics. If it
> is
> > > not,
> > > > > > > there's
> > > > > > > > > an
> > > > > > > > > > > > > argument
> > > > > > > > > > > > > > for keeping it simple.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > 2. Would it make sense to use 2 gauges for the
> > > version
> > > > > and
> > > > > > > > commit
> > > > > > > > > > id?
> > > > > > > > > > > > It
> > > > > > > > > > > > > > seems more intuitive than having those values as
> > > tags.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Ismael
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Thu, Aug 17, 2017 at 10:19 AM, Rajini Sivaram
> <
> > > > > > > > > > > > > [email protected]>
> > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Hi Jun,
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Thank you for the review.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > 1. Makes sense. I have updated the KIP.
> > > > > > > > > > > > > > > 2. Moved to a new group ZooKeeperClient
> > > > > > > > > > > > > > > 3. It is a gauge, so it will have a single
> > > attribute
> > > > > > called
> > > > > > > > > Value
> > > > > > > > > > > > with
> > > > > > > > > > > > > a
> > > > > > > > > > > > > > > constant value of 1.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Regards,
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Rajini
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On Thu, Aug 17, 2017 at 3:16 AM, Jun Rao <
> > > > > > [email protected]
> > > > > > > >
> > > > > > > > > > wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Hi, Rajini,
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Thanks for the KIP. A few comments.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > 1. We have 30+ requests and 30+ error code
> and
> > > > > growing.
> > > > > > > So,
> > > > > > > > > the
> > > > > > > > > > > > > > > combination
> > > > > > > > > > > > > > > > can be large. Perhaps it's useful to expire
> an
> > > > error
> > > > > > > metric
> > > > > > > > > if
> > > > > > > > > > > it's
> > > > > > > > > > > > > no
> > > > > > > > > > > > > > > > longer updated after some time? We did
> > something
> > > > > > similar
> > > > > > > > for
> > > > > > > > > > the
> > > > > > > > > > > > > quota
> > > > > > > > > > > > > > > > metric.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > 2. It's a bit weird to put the ZK latency
> > metric
> > > > > under
> > > > > > > > > > > > > > > > type=SessionExpireListener.
> > > > > > > > > > > > > > > > Perhaps it's more intuitive to put it in a
> > > separate
> > > > > > type.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > 3. For the client version metric, since we
> > > > > representing
> > > > > > > > > > commit_id
> > > > > > > > > > > > and
> > > > > > > > > > > > > > > > version as tags in the metric name. So the
> > mbean
> > > > will
> > > > > > > have
> > > > > > > > no
> > > > > > > > > > > > > > attributes?
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Jun
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > On Wed, Aug 16, 2017 at 4:05 PM, Roger
> Hoover <
> > > > > > > > > > > > > [email protected]>
> > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > I think it would useful to make clear
> > somewhere
> > > > for
> > > > > > > each
> > > > > > > > > > > metric,
> > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > level
> > > > > > > > > > > > > > > > > at which it's counted.  I don't know all
> the
> > > > > details
> > > > > > of
> > > > > > > > the
> > > > > > > > > > > Kafka
> > > > > > > > > > > > > > > > protocol
> > > > > > > > > > > > > > > > > but it might be something like
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > ProduceRequest, Fetch Request - counted at
> > > > > > > per-partition
> > > > > > > > > > level
> > > > > > > > > > > > > > > > > All other requests are 1:1 with client
> > > requests?
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Cheers,
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Roger
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > On Wed, Aug 16, 2017 at 4:02 PM, Roger
> > Hoover <
> > > > > > > > > > > > > > [email protected]>
> > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Rajini,
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Thank you for the KIP.  These are very
> > > helpful
> > > > > > > > additions.
> > > > > > > > > > > One
> > > > > > > > > > > > > > > question
> > > > > > > > > > > > > > > > > on
> > > > > > > > > > > > > > > > > > the error code metrics:
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Will the total error counting happen at
> the
> > > the
> > > > > > level
> > > > > > > > of
> > > > > > > > > > > topic
> > > > > > > > > > > > > > > > partition?
> > > > > > > > > > > > > > > > > > For example, if a single ProduceRequest
> > > > contains
> > > > > > > > messages
> > > > > > > > > > to
> > > > > > > > > > > > > append
> > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > 3
> > > > > > > > > > > > > > > > > > partitions and say all 3 appends are
> > > > successful,
> > > > > > the
> > > > > > > > > > counter
> > > > > > > > > > > > > > > > > > for kafka.network:type=
> > RequestMetrics,name=
> > > > > > > > > > > > ErrorsPerSec,request=
> > > > > > > > > > > > > > > > > ProduceRequest,error=0
> > > > > > > > > > > > > > > > > > will be incremented by 3?
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Roger
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > On Wed, Aug 16, 2017 at 12:10 PM, Rajini
> > > > Sivaram
> > > > > <
> > > > > > > > > > > > > > > > > [email protected]>
> > > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >> I have created a KIP to add some
> > additional
> > > > > > metrics
> > > > > > > to
> > > > > > > > > > > support
> > > > > > > > > > > > > > > health
> > > > > > > > > > > > > > > > > >> checks:
> > > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > > >> https://cwiki.apache.org/
> > > > > > > > confluence/display/KAFKA/KIP-
> > > > > > > > > > > 188+-+
> > > > > > > > > > > > > > > > > >> Add+new+metrics+to+support+
> health+checks
> > > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > > >> Feedback and suggestions are welcome.
> > > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > > >> Regards,
> > > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > > >> Rajini
> > > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-188 - Add new metrics to support health checks

Reply via email to