Re: [DISCUSS] KIP-188 - Add new metrics to support health checks

2017-08-31 Thread Rajini Sivaram
Hi Jun,

Thank you, that makes sense. Updated the KIP.

Regards,

Rajini

On Thu, Aug 31, 2017 at 12:35 PM, Jun Rao  wrote:

> Hi, Rajini,
>
> Thanks for the updated KIP. Should be also track the conversion time for
> the producer? If so, the name can just be MessageConversionTimeMs and only
> the produce and the fetch request may have such a component.
>
> Jun
>
> On Wed, Aug 30, 2017 at 1:37 PM, Rajini Sivaram 
> wrote:
>
> > Jun,
> >
> > Thank you, your suggestions sound good. I have updated the KIP.
> >
> > Regards,
> >
> > Rajini
> >
> > On Tue, Aug 29, 2017 at 9:12 PM, Jun Rao  wrote:
> >
> > > Hi, Rajini,
> > >
> > > Thanks for the updated KIP. I agree that those additional metrics can
> be
> > > useful. I was thinking what would an admin do if the value of one of
> > > those metrics is abnormal. An admin probably want to determine which
> > client
> > > causes the abnormaly. So,  a couple of more comments below.
> > >
> > > 10. About FetchDownConversionsMs. Should we model it at the topic level
> > or
> > > at the request level as a new latency component like
> > messageConversionTime
> > > in addition to the existing localTime, requestQueueTime, etc? The
> benefit
> > > of the latter is that we can include it in the request log and use it
> to
> > > figure out which client is doing the conversion. Also, should we also
> > track
> > > the conversion time in the producer?
> > >
> > > 11. About ProduceBatchSize. Currently, the largest chunk of memory is
> > > allocated at the request level (mostly produce request). So, instead
> > > of ProduceBatchSize
> > > , perhaps we can add a request size metric for each type of request and
> > > also include it in the request log. This way, if there is a memory
> issue,
> > > we can trace it back to a particular client. Similarly, for
> > > ProduceUncompressedBatchSize,
> > > would it be better to track it at the request level as something like
> > > temporaryMemorySize and include it in the request log?
> > >
> > > Thanks,
> > >
> > > Jun
> > >
> > > On Tue, Aug 29, 2017 at 10:07 AM, Roger Hoover  >
> > > wrote:
> > >
> > > > Great suggestions, Ismael and thanks for incorporating them, Rajini.
> > > >
> > > > Tracking authentication success and failures (#3) across listeners
> > seems
> > > > very useful for cluster administrators to identify misconfigured
> client
> > > or
> > > > bad actors, especially until all clients implement KIP-152 which will
> > add
> > > > an explicit error code for authentication failures.  Currently,
> clients
> > > > just get disconnected so it's hard to distinguish authentication
> > failures
> > > > from any other error that can cause disconnect.  This broker-side
> > metric
> > > is
> > > > useful regardless but can help fill this gap until all clients
> support
> > > KIP
> > > > 152.
> > > >
> > > > Just to be clear, the ones called `successful-authentication-rate`
> and
> > > > `failed-authentication-rate` will also have
> failed-authentication-count
> > > > and successful-authentication-count to match KIP 187?
> > > >
> > > > On Tue, Aug 29, 2017 at 7:26 AM, Rajini Sivaram <
> > rajinisiva...@gmail.com
> > > >
> > > > wrote:
> > > >
> > > > > Hi Ismael,
> > > > >
> > > > > Thank you for the suggestions. The additional metrics sound very
> > useful
> > > > and
> > > > > I have added them to the KIP.
> > > > >
> > > > > Regards,
> > > > >
> > > > > Rajini
> > > > >
> > > > > On Tue, Aug 29, 2017 at 5:34 AM, Ismael Juma 
> > > wrote:
> > > > >
> > > > > > Hi Rajini,
> > > > > >
> > > > > > There are a few other metrics that could potentially be useful.
> I'd
> > > be
> > > > > > interested in what you and the community thinks:
> > > > > >
> > > > > > 1. The KIP currently includes `FetchDownConversionsPerSec`, which
> > is
> > > > > > useful. In the common case, one would want to avoid down
> conversion
> > > by
> > > > > > using the lower message format supported by most of the
> consumers.
> > > > > However,
> > > > > > there are good reasons to use a newer message format even if
> there
> > > are
> > > > > some
> > > > > > legacy consumers around. It would be good to quantify the cost of
> > > these
> > > > > > consumers a bit more clearly. Looking at the request metric
> > > > `LocalTimeMs`
> > > > > > provides a hint, but it may be useful to have a dedicated
> > > > > > `FetchDownConversionsMs` metric.
> > > > > >
> > > > > > 2. Large messages can cause GC issues (it's particularly bad if
> > fetch
> > > > > down
> > > > > > conversion takes place). One can currently configure the max
> > message
> > > > > batch
> > > > > > size per topic to keep this under control, but that is the size
> > after
> > > > > > compression. However, we decompress the batch to validate produce
> > > > > requests
> > > > > > and we decompress and recompress during fetch downconversion). It
> > > would
> > > > > be
> > > > > > helpful to 

Re: [DISCUSS] KIP-188 - Add new metrics to support health checks

2017-08-31 Thread Manikumar
Hi Rajini,

Yes,  It provides the failure rate for all errors. I got confused with old
metric names.
Thank you for the clarification.

Thanks

On Thu, Aug 31, 2017 at 6:34 PM, Rajini Sivaram 
wrote:

> Hi Manikumar,
>
> We currently have topic-level metrics for FailedProduceRequestsPerSec,
> FailedFetchRequestsPerSec. For all requests including produce/fetch, this
> KIP adds a new error rate metric:
>
> MBean:
> kafka.network:type=RequestMetrics,name=ErrorsPerSec,request=api_key_
> name,error=error_code_name
>
> For each request-type, this would give the error rate of every error-code.
> Does this not provide the failure rate you are looking for?
>
> Thank you,
>
> Rajini
>
> On Thu, Aug 31, 2017 at 7:37 AM, Manikumar 
> wrote:
>
> > Hi,
> >
> > Currently, we have FailedProduceRequestsPerSec, FailedFetchRequestsPerSec
> > metrics to indicate
> > un-expected failures on a broker while handling producer/fetch requests.
> > Will it be useful to add these
> > metrics for other requests also? I don't want to include these metrics to
> > this KIP. Just want to check the
> > usefulness of these failed request metrics.
> >
> > Thanks
> >
> > On Thu, Aug 31, 2017 at 2:07 AM, Rajini Sivaram  >
> > wrote:
> >
> > > Jun,
> > >
> > > Thank you, your suggestions sound good. I have updated the KIP.
> > >
> > > Regards,
> > >
> > > Rajini
> > >
> > > On Tue, Aug 29, 2017 at 9:12 PM, Jun Rao  wrote:
> > >
> > > > Hi, Rajini,
> > > >
> > > > Thanks for the updated KIP. I agree that those additional metrics can
> > be
> > > > useful. I was thinking what would an admin do if the value of one of
> > > > those metrics is abnormal. An admin probably want to determine which
> > > client
> > > > causes the abnormaly. So,  a couple of more comments below.
> > > >
> > > > 10. About FetchDownConversionsMs. Should we model it at the topic
> level
> > > or
> > > > at the request level as a new latency component like
> > > messageConversionTime
> > > > in addition to the existing localTime, requestQueueTime, etc? The
> > benefit
> > > > of the latter is that we can include it in the request log and use it
> > to
> > > > figure out which client is doing the conversion. Also, should we also
> > > track
> > > > the conversion time in the producer?
> > > >
> > > > 11. About ProduceBatchSize. Currently, the largest chunk of memory is
> > > > allocated at the request level (mostly produce request). So, instead
> > > > of ProduceBatchSize
> > > > , perhaps we can add a request size metric for each type of request
> and
> > > > also include it in the request log. This way, if there is a memory
> > issue,
> > > > we can trace it back to a particular client. Similarly, for
> > > > ProduceUncompressedBatchSize,
> > > > would it be better to track it at the request level as something like
> > > > temporaryMemorySize and include it in the request log?
> > > >
> > > > Thanks,
> > > >
> > > > Jun
> > > >
> > > > On Tue, Aug 29, 2017 at 10:07 AM, Roger Hoover <
> roger.hoo...@gmail.com
> > >
> > > > wrote:
> > > >
> > > > > Great suggestions, Ismael and thanks for incorporating them,
> Rajini.
> > > > >
> > > > > Tracking authentication success and failures (#3) across listeners
> > > seems
> > > > > very useful for cluster administrators to identify misconfigured
> > client
> > > > or
> > > > > bad actors, especially until all clients implement KIP-152 which
> will
> > > add
> > > > > an explicit error code for authentication failures.  Currently,
> > clients
> > > > > just get disconnected so it's hard to distinguish authentication
> > > failures
> > > > > from any other error that can cause disconnect.  This broker-side
> > > metric
> > > > is
> > > > > useful regardless but can help fill this gap until all clients
> > support
> > > > KIP
> > > > > 152.
> > > > >
> > > > > Just to be clear, the ones called `successful-authentication-rate`
> > and
> > > > > `failed-authentication-rate` will also have
> > failed-authentication-count
> > > > > and successful-authentication-count to match KIP 187?
> > > > >
> > > > > On Tue, Aug 29, 2017 at 7:26 AM, Rajini Sivaram <
> > > rajinisiva...@gmail.com
> > > > >
> > > > > wrote:
> > > > >
> > > > > > Hi Ismael,
> > > > > >
> > > > > > Thank you for the suggestions. The additional metrics sound very
> > > useful
> > > > > and
> > > > > > I have added them to the KIP.
> > > > > >
> > > > > > Regards,
> > > > > >
> > > > > > Rajini
> > > > > >
> > > > > > On Tue, Aug 29, 2017 at 5:34 AM, Ismael Juma 
> > > > wrote:
> > > > > >
> > > > > > > Hi Rajini,
> > > > > > >
> > > > > > > There are a few other metrics that could potentially be useful.
> > I'd
> > > > be
> > > > > > > interested in what you and the community thinks:
> > > > > > >
> > > > > > > 1. The KIP currently includes `FetchDownConversionsPerSec`,
> which
> > > is
> > > > > > > useful. In the common case, one would want to avoid down
> > 

Re: [DISCUSS] KIP-188 - Add new metrics to support health checks

2017-08-31 Thread Rajini Sivaram
Hi Roger,

Yes, the rate metrics `successful-authentication-rate` and `
failed-authentication-rate` will also have corresponding
failed-authentication-count and successful-authentication-count to match
KIP 187.

Thank you,

Rajini

On Tue, Aug 29, 2017 at 1:07 PM, Roger Hoover 
wrote:

> Great suggestions, Ismael and thanks for incorporating them, Rajini.
>
> Tracking authentication success and failures (#3) across listeners seems
> very useful for cluster administrators to identify misconfigured client or
> bad actors, especially until all clients implement KIP-152 which will add
> an explicit error code for authentication failures.  Currently, clients
> just get disconnected so it's hard to distinguish authentication failures
> from any other error that can cause disconnect.  This broker-side metric is
> useful regardless but can help fill this gap until all clients support KIP
> 152.
>
> Just to be clear, the ones called `successful-authentication-rate` and
> `failed-authentication-rate` will also have failed-authentication-count
> and successful-authentication-count to match KIP 187?
>
> On Tue, Aug 29, 2017 at 7:26 AM, Rajini Sivaram 
> wrote:
>
> > Hi Ismael,
> >
> > Thank you for the suggestions. The additional metrics sound very useful
> and
> > I have added them to the KIP.
> >
> > Regards,
> >
> > Rajini
> >
> > On Tue, Aug 29, 2017 at 5:34 AM, Ismael Juma  wrote:
> >
> > > Hi Rajini,
> > >
> > > There are a few other metrics that could potentially be useful. I'd be
> > > interested in what you and the community thinks:
> > >
> > > 1. The KIP currently includes `FetchDownConversionsPerSec`, which is
> > > useful. In the common case, one would want to avoid down conversion by
> > > using the lower message format supported by most of the consumers.
> > However,
> > > there are good reasons to use a newer message format even if there are
> > some
> > > legacy consumers around. It would be good to quantify the cost of these
> > > consumers a bit more clearly. Looking at the request metric
> `LocalTimeMs`
> > > provides a hint, but it may be useful to have a dedicated
> > > `FetchDownConversionsMs` metric.
> > >
> > > 2. Large messages can cause GC issues (it's particularly bad if fetch
> > down
> > > conversion takes place). One can currently configure the max message
> > batch
> > > size per topic to keep this under control, but that is the size after
> > > compression. However, we decompress the batch to validate produce
> > requests
> > > and we decompress and recompress during fetch downconversion). It would
> > be
> > > helpful to have topic metrics for the produce message batch size after
> > > decompression (and perhaps compressed as well as that would help
> > understand
> > > the compression ratio).
> > >
> > > 3. Authentication success/failures per second. This is helpful to
> > > understand if some clients are misconfigured or if bad actors are
> trying
> > to
> > > authenticate.
> > >
> > > Thoughts?
> > >
> > > Ismael
> > >
> > >
> > >
> > > On Wed, Aug 23, 2017 at 2:53 AM, Jun Rao  wrote:
> > >
> > > > Hi, Rajini,
> > > >
> > > > Yes, if those error metrics are registered dynamically, we could
> worry
> > > > about expiration later.
> > > >
> > > > Thanks,
> > > >
> > > > Jun
> > > >
> > > > On Fri, Aug 18, 2017 at 1:55 AM, Rajini Sivaram <
> > rajinisiva...@gmail.com
> > > >
> > > > wrote:
> > > >
> > > > > Perhaps we could register dynamically for now. If we find that the
> > cost
> > > > of
> > > > > retaining these is high, we can add the code to expire them later.
> Is
> > > > that
> > > > > ok?
> > > > >
> > > > > Regards,
> > > > >
> > > > > Rajini
> > > > >
> > > > >
> > > > >
> > > > > On Fri, Aug 18, 2017 at 9:41 AM, Ismael Juma 
> > > wrote:
> > > > >
> > > > > > Can we quantify the cost of having these metrics around if they
> are
> > > > > > dynamically registered? Given that the maximum is bounded at
> > > > development
> > > > > > time, is it really worth all the extra code?
> > > > > >
> > > > > > Ismael
> > > > > >
> > > > > > On Fri, Aug 18, 2017 at 9:34 AM, Rajini Sivaram <
> > > > rajinisiva...@gmail.com
> > > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Jun,
> > > > > > >
> > > > > > > It feels more consistent to add errors as yammer metrics
> similar
> > to
> > > > > other
> > > > > > > request metrics. Perhaps we could add some code to track and
> > remove
> > > > > these
> > > > > > > if unused? It is a bit more work, but it would keep the
> externals
> > > > > > > consistent.
> > > > > > >
> > > > > > > Ismael/Manikumar,
> > > > > > >
> > > > > > > Agree that version as a String attribute makes more sense.
> > > > > Unfortunately,
> > > > > > > the whole KafkaMetric implementation is written around a single
> > > > > "double"
> > > > > > > type, so introducing a new type is a big change. But I suppose
> it
> > > can
> > > > > be
> > > > > > > done. I 

Re: [DISCUSS] KIP-188 - Add new metrics to support health checks

2017-08-31 Thread Rajini Sivaram
Hi Manikumar,

We currently have topic-level metrics for FailedProduceRequestsPerSec,
FailedFetchRequestsPerSec. For all requests including produce/fetch, this
KIP adds a new error rate metric:

MBean:
kafka.network:type=RequestMetrics,name=ErrorsPerSec,request=api_key_name,error=error_code_name

For each request-type, this would give the error rate of every error-code.
Does this not provide the failure rate you are looking for?

Thank you,

Rajini

On Thu, Aug 31, 2017 at 7:37 AM, Manikumar 
wrote:

> Hi,
>
> Currently, we have FailedProduceRequestsPerSec, FailedFetchRequestsPerSec
> metrics to indicate
> un-expected failures on a broker while handling producer/fetch requests.
> Will it be useful to add these
> metrics for other requests also? I don't want to include these metrics to
> this KIP. Just want to check the
> usefulness of these failed request metrics.
>
> Thanks
>
> On Thu, Aug 31, 2017 at 2:07 AM, Rajini Sivaram 
> wrote:
>
> > Jun,
> >
> > Thank you, your suggestions sound good. I have updated the KIP.
> >
> > Regards,
> >
> > Rajini
> >
> > On Tue, Aug 29, 2017 at 9:12 PM, Jun Rao  wrote:
> >
> > > Hi, Rajini,
> > >
> > > Thanks for the updated KIP. I agree that those additional metrics can
> be
> > > useful. I was thinking what would an admin do if the value of one of
> > > those metrics is abnormal. An admin probably want to determine which
> > client
> > > causes the abnormaly. So,  a couple of more comments below.
> > >
> > > 10. About FetchDownConversionsMs. Should we model it at the topic level
> > or
> > > at the request level as a new latency component like
> > messageConversionTime
> > > in addition to the existing localTime, requestQueueTime, etc? The
> benefit
> > > of the latter is that we can include it in the request log and use it
> to
> > > figure out which client is doing the conversion. Also, should we also
> > track
> > > the conversion time in the producer?
> > >
> > > 11. About ProduceBatchSize. Currently, the largest chunk of memory is
> > > allocated at the request level (mostly produce request). So, instead
> > > of ProduceBatchSize
> > > , perhaps we can add a request size metric for each type of request and
> > > also include it in the request log. This way, if there is a memory
> issue,
> > > we can trace it back to a particular client. Similarly, for
> > > ProduceUncompressedBatchSize,
> > > would it be better to track it at the request level as something like
> > > temporaryMemorySize and include it in the request log?
> > >
> > > Thanks,
> > >
> > > Jun
> > >
> > > On Tue, Aug 29, 2017 at 10:07 AM, Roger Hoover  >
> > > wrote:
> > >
> > > > Great suggestions, Ismael and thanks for incorporating them, Rajini.
> > > >
> > > > Tracking authentication success and failures (#3) across listeners
> > seems
> > > > very useful for cluster administrators to identify misconfigured
> client
> > > or
> > > > bad actors, especially until all clients implement KIP-152 which will
> > add
> > > > an explicit error code for authentication failures.  Currently,
> clients
> > > > just get disconnected so it's hard to distinguish authentication
> > failures
> > > > from any other error that can cause disconnect.  This broker-side
> > metric
> > > is
> > > > useful regardless but can help fill this gap until all clients
> support
> > > KIP
> > > > 152.
> > > >
> > > > Just to be clear, the ones called `successful-authentication-rate`
> and
> > > > `failed-authentication-rate` will also have
> failed-authentication-count
> > > > and successful-authentication-count to match KIP 187?
> > > >
> > > > On Tue, Aug 29, 2017 at 7:26 AM, Rajini Sivaram <
> > rajinisiva...@gmail.com
> > > >
> > > > wrote:
> > > >
> > > > > Hi Ismael,
> > > > >
> > > > > Thank you for the suggestions. The additional metrics sound very
> > useful
> > > > and
> > > > > I have added them to the KIP.
> > > > >
> > > > > Regards,
> > > > >
> > > > > Rajini
> > > > >
> > > > > On Tue, Aug 29, 2017 at 5:34 AM, Ismael Juma 
> > > wrote:
> > > > >
> > > > > > Hi Rajini,
> > > > > >
> > > > > > There are a few other metrics that could potentially be useful.
> I'd
> > > be
> > > > > > interested in what you and the community thinks:
> > > > > >
> > > > > > 1. The KIP currently includes `FetchDownConversionsPerSec`, which
> > is
> > > > > > useful. In the common case, one would want to avoid down
> conversion
> > > by
> > > > > > using the lower message format supported by most of the
> consumers.
> > > > > However,
> > > > > > there are good reasons to use a newer message format even if
> there
> > > are
> > > > > some
> > > > > > legacy consumers around. It would be good to quantify the cost of
> > > these
> > > > > > consumers a bit more clearly. Looking at the request metric
> > > > `LocalTimeMs`
> > > > > > provides a hint, but it may be useful to have a dedicated
> > > > > > `FetchDownConversionsMs` 

Re: [DISCUSS] KIP-188 - Add new metrics to support health checks

2017-08-31 Thread Manikumar
Hi,

Currently, we have FailedProduceRequestsPerSec, FailedFetchRequestsPerSec
metrics to indicate
un-expected failures on a broker while handling producer/fetch requests.
Will it be useful to add these
metrics for other requests also? I don't want to include these metrics to
this KIP. Just want to check the
usefulness of these failed request metrics.

Thanks

On Thu, Aug 31, 2017 at 2:07 AM, Rajini Sivaram 
wrote:

> Jun,
>
> Thank you, your suggestions sound good. I have updated the KIP.
>
> Regards,
>
> Rajini
>
> On Tue, Aug 29, 2017 at 9:12 PM, Jun Rao  wrote:
>
> > Hi, Rajini,
> >
> > Thanks for the updated KIP. I agree that those additional metrics can be
> > useful. I was thinking what would an admin do if the value of one of
> > those metrics is abnormal. An admin probably want to determine which
> client
> > causes the abnormaly. So,  a couple of more comments below.
> >
> > 10. About FetchDownConversionsMs. Should we model it at the topic level
> or
> > at the request level as a new latency component like
> messageConversionTime
> > in addition to the existing localTime, requestQueueTime, etc? The benefit
> > of the latter is that we can include it in the request log and use it to
> > figure out which client is doing the conversion. Also, should we also
> track
> > the conversion time in the producer?
> >
> > 11. About ProduceBatchSize. Currently, the largest chunk of memory is
> > allocated at the request level (mostly produce request). So, instead
> > of ProduceBatchSize
> > , perhaps we can add a request size metric for each type of request and
> > also include it in the request log. This way, if there is a memory issue,
> > we can trace it back to a particular client. Similarly, for
> > ProduceUncompressedBatchSize,
> > would it be better to track it at the request level as something like
> > temporaryMemorySize and include it in the request log?
> >
> > Thanks,
> >
> > Jun
> >
> > On Tue, Aug 29, 2017 at 10:07 AM, Roger Hoover 
> > wrote:
> >
> > > Great suggestions, Ismael and thanks for incorporating them, Rajini.
> > >
> > > Tracking authentication success and failures (#3) across listeners
> seems
> > > very useful for cluster administrators to identify misconfigured client
> > or
> > > bad actors, especially until all clients implement KIP-152 which will
> add
> > > an explicit error code for authentication failures.  Currently, clients
> > > just get disconnected so it's hard to distinguish authentication
> failures
> > > from any other error that can cause disconnect.  This broker-side
> metric
> > is
> > > useful regardless but can help fill this gap until all clients support
> > KIP
> > > 152.
> > >
> > > Just to be clear, the ones called `successful-authentication-rate` and
> > > `failed-authentication-rate` will also have failed-authentication-count
> > > and successful-authentication-count to match KIP 187?
> > >
> > > On Tue, Aug 29, 2017 at 7:26 AM, Rajini Sivaram <
> rajinisiva...@gmail.com
> > >
> > > wrote:
> > >
> > > > Hi Ismael,
> > > >
> > > > Thank you for the suggestions. The additional metrics sound very
> useful
> > > and
> > > > I have added them to the KIP.
> > > >
> > > > Regards,
> > > >
> > > > Rajini
> > > >
> > > > On Tue, Aug 29, 2017 at 5:34 AM, Ismael Juma 
> > wrote:
> > > >
> > > > > Hi Rajini,
> > > > >
> > > > > There are a few other metrics that could potentially be useful. I'd
> > be
> > > > > interested in what you and the community thinks:
> > > > >
> > > > > 1. The KIP currently includes `FetchDownConversionsPerSec`, which
> is
> > > > > useful. In the common case, one would want to avoid down conversion
> > by
> > > > > using the lower message format supported by most of the consumers.
> > > > However,
> > > > > there are good reasons to use a newer message format even if there
> > are
> > > > some
> > > > > legacy consumers around. It would be good to quantify the cost of
> > these
> > > > > consumers a bit more clearly. Looking at the request metric
> > > `LocalTimeMs`
> > > > > provides a hint, but it may be useful to have a dedicated
> > > > > `FetchDownConversionsMs` metric.
> > > > >
> > > > > 2. Large messages can cause GC issues (it's particularly bad if
> fetch
> > > > down
> > > > > conversion takes place). One can currently configure the max
> message
> > > > batch
> > > > > size per topic to keep this under control, but that is the size
> after
> > > > > compression. However, we decompress the batch to validate produce
> > > > requests
> > > > > and we decompress and recompress during fetch downconversion). It
> > would
> > > > be
> > > > > helpful to have topic metrics for the produce message batch size
> > after
> > > > > decompression (and perhaps compressed as well as that would help
> > > > understand
> > > > > the compression ratio).
> > > > >
> > > > > 3. Authentication success/failures per second. This is helpful to
> > > > > understand if 

Re: [DISCUSS] KIP-188 - Add new metrics to support health checks

2017-08-30 Thread Rajini Sivaram
Jun,

Thank you, your suggestions sound good. I have updated the KIP.

Regards,

Rajini

On Tue, Aug 29, 2017 at 9:12 PM, Jun Rao  wrote:

> Hi, Rajini,
>
> Thanks for the updated KIP. I agree that those additional metrics can be
> useful. I was thinking what would an admin do if the value of one of
> those metrics is abnormal. An admin probably want to determine which client
> causes the abnormaly. So,  a couple of more comments below.
>
> 10. About FetchDownConversionsMs. Should we model it at the topic level or
> at the request level as a new latency component like messageConversionTime
> in addition to the existing localTime, requestQueueTime, etc? The benefit
> of the latter is that we can include it in the request log and use it to
> figure out which client is doing the conversion. Also, should we also track
> the conversion time in the producer?
>
> 11. About ProduceBatchSize. Currently, the largest chunk of memory is
> allocated at the request level (mostly produce request). So, instead
> of ProduceBatchSize
> , perhaps we can add a request size metric for each type of request and
> also include it in the request log. This way, if there is a memory issue,
> we can trace it back to a particular client. Similarly, for
> ProduceUncompressedBatchSize,
> would it be better to track it at the request level as something like
> temporaryMemorySize and include it in the request log?
>
> Thanks,
>
> Jun
>
> On Tue, Aug 29, 2017 at 10:07 AM, Roger Hoover 
> wrote:
>
> > Great suggestions, Ismael and thanks for incorporating them, Rajini.
> >
> > Tracking authentication success and failures (#3) across listeners seems
> > very useful for cluster administrators to identify misconfigured client
> or
> > bad actors, especially until all clients implement KIP-152 which will add
> > an explicit error code for authentication failures.  Currently, clients
> > just get disconnected so it's hard to distinguish authentication failures
> > from any other error that can cause disconnect.  This broker-side metric
> is
> > useful regardless but can help fill this gap until all clients support
> KIP
> > 152.
> >
> > Just to be clear, the ones called `successful-authentication-rate` and
> > `failed-authentication-rate` will also have failed-authentication-count
> > and successful-authentication-count to match KIP 187?
> >
> > On Tue, Aug 29, 2017 at 7:26 AM, Rajini Sivaram  >
> > wrote:
> >
> > > Hi Ismael,
> > >
> > > Thank you for the suggestions. The additional metrics sound very useful
> > and
> > > I have added them to the KIP.
> > >
> > > Regards,
> > >
> > > Rajini
> > >
> > > On Tue, Aug 29, 2017 at 5:34 AM, Ismael Juma 
> wrote:
> > >
> > > > Hi Rajini,
> > > >
> > > > There are a few other metrics that could potentially be useful. I'd
> be
> > > > interested in what you and the community thinks:
> > > >
> > > > 1. The KIP currently includes `FetchDownConversionsPerSec`, which is
> > > > useful. In the common case, one would want to avoid down conversion
> by
> > > > using the lower message format supported by most of the consumers.
> > > However,
> > > > there are good reasons to use a newer message format even if there
> are
> > > some
> > > > legacy consumers around. It would be good to quantify the cost of
> these
> > > > consumers a bit more clearly. Looking at the request metric
> > `LocalTimeMs`
> > > > provides a hint, but it may be useful to have a dedicated
> > > > `FetchDownConversionsMs` metric.
> > > >
> > > > 2. Large messages can cause GC issues (it's particularly bad if fetch
> > > down
> > > > conversion takes place). One can currently configure the max message
> > > batch
> > > > size per topic to keep this under control, but that is the size after
> > > > compression. However, we decompress the batch to validate produce
> > > requests
> > > > and we decompress and recompress during fetch downconversion). It
> would
> > > be
> > > > helpful to have topic metrics for the produce message batch size
> after
> > > > decompression (and perhaps compressed as well as that would help
> > > understand
> > > > the compression ratio).
> > > >
> > > > 3. Authentication success/failures per second. This is helpful to
> > > > understand if some clients are misconfigured or if bad actors are
> > trying
> > > to
> > > > authenticate.
> > > >
> > > > Thoughts?
> > > >
> > > > Ismael
> > > >
> > > >
> > > >
> > > > On Wed, Aug 23, 2017 at 2:53 AM, Jun Rao  wrote:
> > > >
> > > > > Hi, Rajini,
> > > > >
> > > > > Yes, if those error metrics are registered dynamically, we could
> > worry
> > > > > about expiration later.
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Jun
> > > > >
> > > > > On Fri, Aug 18, 2017 at 1:55 AM, Rajini Sivaram <
> > > rajinisiva...@gmail.com
> > > > >
> > > > > wrote:
> > > > >
> > > > > > Perhaps we could register dynamically for now. If we find that
> the
> > > cost
> > > > > of
> > 

Re: [DISCUSS] KIP-188 - Add new metrics to support health checks

2017-08-29 Thread Jun Rao
Hi, Rajini,

Thanks for the updated KIP. I agree that those additional metrics can be
useful. I was thinking what would an admin do if the value of one of
those metrics is abnormal. An admin probably want to determine which client
causes the abnormaly. So,  a couple of more comments below.

10. About FetchDownConversionsMs. Should we model it at the topic level or
at the request level as a new latency component like messageConversionTime
in addition to the existing localTime, requestQueueTime, etc? The benefit
of the latter is that we can include it in the request log and use it to
figure out which client is doing the conversion. Also, should we also track
the conversion time in the producer?

11. About ProduceBatchSize. Currently, the largest chunk of memory is
allocated at the request level (mostly produce request). So, instead
of ProduceBatchSize
, perhaps we can add a request size metric for each type of request and
also include it in the request log. This way, if there is a memory issue,
we can trace it back to a particular client. Similarly, for
ProduceUncompressedBatchSize,
would it be better to track it at the request level as something like
temporaryMemorySize and include it in the request log?

Thanks,

Jun

On Tue, Aug 29, 2017 at 10:07 AM, Roger Hoover 
wrote:

> Great suggestions, Ismael and thanks for incorporating them, Rajini.
>
> Tracking authentication success and failures (#3) across listeners seems
> very useful for cluster administrators to identify misconfigured client or
> bad actors, especially until all clients implement KIP-152 which will add
> an explicit error code for authentication failures.  Currently, clients
> just get disconnected so it's hard to distinguish authentication failures
> from any other error that can cause disconnect.  This broker-side metric is
> useful regardless but can help fill this gap until all clients support KIP
> 152.
>
> Just to be clear, the ones called `successful-authentication-rate` and
> `failed-authentication-rate` will also have failed-authentication-count
> and successful-authentication-count to match KIP 187?
>
> On Tue, Aug 29, 2017 at 7:26 AM, Rajini Sivaram 
> wrote:
>
> > Hi Ismael,
> >
> > Thank you for the suggestions. The additional metrics sound very useful
> and
> > I have added them to the KIP.
> >
> > Regards,
> >
> > Rajini
> >
> > On Tue, Aug 29, 2017 at 5:34 AM, Ismael Juma  wrote:
> >
> > > Hi Rajini,
> > >
> > > There are a few other metrics that could potentially be useful. I'd be
> > > interested in what you and the community thinks:
> > >
> > > 1. The KIP currently includes `FetchDownConversionsPerSec`, which is
> > > useful. In the common case, one would want to avoid down conversion by
> > > using the lower message format supported by most of the consumers.
> > However,
> > > there are good reasons to use a newer message format even if there are
> > some
> > > legacy consumers around. It would be good to quantify the cost of these
> > > consumers a bit more clearly. Looking at the request metric
> `LocalTimeMs`
> > > provides a hint, but it may be useful to have a dedicated
> > > `FetchDownConversionsMs` metric.
> > >
> > > 2. Large messages can cause GC issues (it's particularly bad if fetch
> > down
> > > conversion takes place). One can currently configure the max message
> > batch
> > > size per topic to keep this under control, but that is the size after
> > > compression. However, we decompress the batch to validate produce
> > requests
> > > and we decompress and recompress during fetch downconversion). It would
> > be
> > > helpful to have topic metrics for the produce message batch size after
> > > decompression (and perhaps compressed as well as that would help
> > understand
> > > the compression ratio).
> > >
> > > 3. Authentication success/failures per second. This is helpful to
> > > understand if some clients are misconfigured or if bad actors are
> trying
> > to
> > > authenticate.
> > >
> > > Thoughts?
> > >
> > > Ismael
> > >
> > >
> > >
> > > On Wed, Aug 23, 2017 at 2:53 AM, Jun Rao  wrote:
> > >
> > > > Hi, Rajini,
> > > >
> > > > Yes, if those error metrics are registered dynamically, we could
> worry
> > > > about expiration later.
> > > >
> > > > Thanks,
> > > >
> > > > Jun
> > > >
> > > > On Fri, Aug 18, 2017 at 1:55 AM, Rajini Sivaram <
> > rajinisiva...@gmail.com
> > > >
> > > > wrote:
> > > >
> > > > > Perhaps we could register dynamically for now. If we find that the
> > cost
> > > > of
> > > > > retaining these is high, we can add the code to expire them later.
> Is
> > > > that
> > > > > ok?
> > > > >
> > > > > Regards,
> > > > >
> > > > > Rajini
> > > > >
> > > > >
> > > > >
> > > > > On Fri, Aug 18, 2017 at 9:41 AM, Ismael Juma 
> > > wrote:
> > > > >
> > > > > > Can we quantify the cost of having these metrics around if they
> are
> > > > > > dynamically registered? Given that the 

Re: [DISCUSS] KIP-188 - Add new metrics to support health checks

2017-08-29 Thread Roger Hoover
Great suggestions, Ismael and thanks for incorporating them, Rajini.

Tracking authentication success and failures (#3) across listeners seems
very useful for cluster administrators to identify misconfigured client or
bad actors, especially until all clients implement KIP-152 which will add
an explicit error code for authentication failures.  Currently, clients
just get disconnected so it's hard to distinguish authentication failures
from any other error that can cause disconnect.  This broker-side metric is
useful regardless but can help fill this gap until all clients support KIP
152.

Just to be clear, the ones called `successful-authentication-rate` and
`failed-authentication-rate` will also have failed-authentication-count
and successful-authentication-count to match KIP 187?

On Tue, Aug 29, 2017 at 7:26 AM, Rajini Sivaram 
wrote:

> Hi Ismael,
>
> Thank you for the suggestions. The additional metrics sound very useful and
> I have added them to the KIP.
>
> Regards,
>
> Rajini
>
> On Tue, Aug 29, 2017 at 5:34 AM, Ismael Juma  wrote:
>
> > Hi Rajini,
> >
> > There are a few other metrics that could potentially be useful. I'd be
> > interested in what you and the community thinks:
> >
> > 1. The KIP currently includes `FetchDownConversionsPerSec`, which is
> > useful. In the common case, one would want to avoid down conversion by
> > using the lower message format supported by most of the consumers.
> However,
> > there are good reasons to use a newer message format even if there are
> some
> > legacy consumers around. It would be good to quantify the cost of these
> > consumers a bit more clearly. Looking at the request metric `LocalTimeMs`
> > provides a hint, but it may be useful to have a dedicated
> > `FetchDownConversionsMs` metric.
> >
> > 2. Large messages can cause GC issues (it's particularly bad if fetch
> down
> > conversion takes place). One can currently configure the max message
> batch
> > size per topic to keep this under control, but that is the size after
> > compression. However, we decompress the batch to validate produce
> requests
> > and we decompress and recompress during fetch downconversion). It would
> be
> > helpful to have topic metrics for the produce message batch size after
> > decompression (and perhaps compressed as well as that would help
> understand
> > the compression ratio).
> >
> > 3. Authentication success/failures per second. This is helpful to
> > understand if some clients are misconfigured or if bad actors are trying
> to
> > authenticate.
> >
> > Thoughts?
> >
> > Ismael
> >
> >
> >
> > On Wed, Aug 23, 2017 at 2:53 AM, Jun Rao  wrote:
> >
> > > Hi, Rajini,
> > >
> > > Yes, if those error metrics are registered dynamically, we could worry
> > > about expiration later.
> > >
> > > Thanks,
> > >
> > > Jun
> > >
> > > On Fri, Aug 18, 2017 at 1:55 AM, Rajini Sivaram <
> rajinisiva...@gmail.com
> > >
> > > wrote:
> > >
> > > > Perhaps we could register dynamically for now. If we find that the
> cost
> > > of
> > > > retaining these is high, we can add the code to expire them later. Is
> > > that
> > > > ok?
> > > >
> > > > Regards,
> > > >
> > > > Rajini
> > > >
> > > >
> > > >
> > > > On Fri, Aug 18, 2017 at 9:41 AM, Ismael Juma 
> > wrote:
> > > >
> > > > > Can we quantify the cost of having these metrics around if they are
> > > > > dynamically registered? Given that the maximum is bounded at
> > > development
> > > > > time, is it really worth all the extra code?
> > > > >
> > > > > Ismael
> > > > >
> > > > > On Fri, Aug 18, 2017 at 9:34 AM, Rajini Sivaram <
> > > rajinisiva...@gmail.com
> > > > >
> > > > > wrote:
> > > > >
> > > > > > Jun,
> > > > > >
> > > > > > It feels more consistent to add errors as yammer metrics similar
> to
> > > > other
> > > > > > request metrics. Perhaps we could add some code to track and
> remove
> > > > these
> > > > > > if unused? It is a bit more work, but it would keep the externals
> > > > > > consistent.
> > > > > >
> > > > > > Ismael/Manikumar,
> > > > > >
> > > > > > Agree that version as a String attribute makes more sense.
> > > > Unfortunately,
> > > > > > the whole KafkaMetric implementation is written around a single
> > > > "double"
> > > > > > type, so introducing a new type is a big change. But I suppose it
> > can
> > > > be
> > > > > > done. I have updated the KIP.
> > > > > >
> > > > > > Regards,
> > > > > >
> > > > > > Rajini
> > > > > >
> > > > > >
> > > > > > On Fri, Aug 18, 2017 at 7:42 AM, Manikumar <
> > > manikumar.re...@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > I agree it will be good if we can add  "commit id/version" as
> an
> > > > > > > attribute value.
> > > > > > > It will be easy to parse. But as of now, KafkaMetric supports
> > only
> > > > > > > numerical values.
> > > > > > >
> > > > > > > On Fri, Aug 18, 2017 at 5:49 AM, Ismael Juma <
> ism...@juma.me.uk>
> > > > > wrote:
> > > > > > >
> > > > 

Re: [DISCUSS] KIP-188 - Add new metrics to support health checks

2017-08-29 Thread Rajini Sivaram
Hi Ismael,

Thank you for the suggestions. The additional metrics sound very useful and
I have added them to the KIP.

Regards,

Rajini

On Tue, Aug 29, 2017 at 5:34 AM, Ismael Juma  wrote:

> Hi Rajini,
>
> There are a few other metrics that could potentially be useful. I'd be
> interested in what you and the community thinks:
>
> 1. The KIP currently includes `FetchDownConversionsPerSec`, which is
> useful. In the common case, one would want to avoid down conversion by
> using the lower message format supported by most of the consumers. However,
> there are good reasons to use a newer message format even if there are some
> legacy consumers around. It would be good to quantify the cost of these
> consumers a bit more clearly. Looking at the request metric `LocalTimeMs`
> provides a hint, but it may be useful to have a dedicated
> `FetchDownConversionsMs` metric.
>
> 2. Large messages can cause GC issues (it's particularly bad if fetch down
> conversion takes place). One can currently configure the max message batch
> size per topic to keep this under control, but that is the size after
> compression. However, we decompress the batch to validate produce requests
> and we decompress and recompress during fetch downconversion). It would be
> helpful to have topic metrics for the produce message batch size after
> decompression (and perhaps compressed as well as that would help understand
> the compression ratio).
>
> 3. Authentication success/failures per second. This is helpful to
> understand if some clients are misconfigured or if bad actors are trying to
> authenticate.
>
> Thoughts?
>
> Ismael
>
>
>
> On Wed, Aug 23, 2017 at 2:53 AM, Jun Rao  wrote:
>
> > Hi, Rajini,
> >
> > Yes, if those error metrics are registered dynamically, we could worry
> > about expiration later.
> >
> > Thanks,
> >
> > Jun
> >
> > On Fri, Aug 18, 2017 at 1:55 AM, Rajini Sivaram  >
> > wrote:
> >
> > > Perhaps we could register dynamically for now. If we find that the cost
> > of
> > > retaining these is high, we can add the code to expire them later. Is
> > that
> > > ok?
> > >
> > > Regards,
> > >
> > > Rajini
> > >
> > >
> > >
> > > On Fri, Aug 18, 2017 at 9:41 AM, Ismael Juma 
> wrote:
> > >
> > > > Can we quantify the cost of having these metrics around if they are
> > > > dynamically registered? Given that the maximum is bounded at
> > development
> > > > time, is it really worth all the extra code?
> > > >
> > > > Ismael
> > > >
> > > > On Fri, Aug 18, 2017 at 9:34 AM, Rajini Sivaram <
> > rajinisiva...@gmail.com
> > > >
> > > > wrote:
> > > >
> > > > > Jun,
> > > > >
> > > > > It feels more consistent to add errors as yammer metrics similar to
> > > other
> > > > > request metrics. Perhaps we could add some code to track and remove
> > > these
> > > > > if unused? It is a bit more work, but it would keep the externals
> > > > > consistent.
> > > > >
> > > > > Ismael/Manikumar,
> > > > >
> > > > > Agree that version as a String attribute makes more sense.
> > > Unfortunately,
> > > > > the whole KafkaMetric implementation is written around a single
> > > "double"
> > > > > type, so introducing a new type is a big change. But I suppose it
> can
> > > be
> > > > > done. I have updated the KIP.
> > > > >
> > > > > Regards,
> > > > >
> > > > > Rajini
> > > > >
> > > > >
> > > > > On Fri, Aug 18, 2017 at 7:42 AM, Manikumar <
> > manikumar.re...@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > I agree it will be good if we can add  "commit id/version" as an
> > > > > > attribute value.
> > > > > > It will be easy to parse. But as of now, KafkaMetric supports
> only
> > > > > > numerical values.
> > > > > >
> > > > > > On Fri, Aug 18, 2017 at 5:49 AM, Ismael Juma 
> > > > wrote:
> > > > > >
> > > > > > > Hi Rajini,
> > > > > > >
> > > > > > > About the gauges, I was thinking that the attribute would be
> the
> > > > value
> > > > > > > (i.e. commit id or version). I understand that Kafka Metrics
> > > doesn't
> > > > > > > support this (unlike Yammer Metrics), but would it make sense
> to
> > > add?
> > > > > > >
> > > > > > > Ismael
> > > > > > >
> > > > > > > On Thu, Aug 17, 2017 at 2:54 PM, Rajini Sivaram <
> > > > > rajinisiva...@gmail.com
> > > > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hi Ismael,
> > > > > > > >
> > > > > > > > Thank you for the review.
> > > > > > > >
> > > > > > > > 1. Agree on keeping it simple with dynamic registration and
> no
> > > > > expiry.
> > > > > > > Will
> > > > > > > > wait for Jun's feedback before updating KIP.
> > > > > > > > 2. I have switched to two metrics for commit-id and version
> > (not
> > > > sure
> > > > > > if
> > > > > > > it
> > > > > > > > matches what you meant). I also added the client-id tag which
> > is
> > > > used
> > > > > > in
> > > > > > > > all metrics from clients.
> > > > > > > >
> > > > > > > > Regards,
> > > > > > > >
> > > > > > > > Rajini

Re: [DISCUSS] KIP-188 - Add new metrics to support health checks

2017-08-29 Thread Ismael Juma
Hi Rajini,

There are a few other metrics that could potentially be useful. I'd be
interested in what you and the community thinks:

1. The KIP currently includes `FetchDownConversionsPerSec`, which is
useful. In the common case, one would want to avoid down conversion by
using the lower message format supported by most of the consumers. However,
there are good reasons to use a newer message format even if there are some
legacy consumers around. It would be good to quantify the cost of these
consumers a bit more clearly. Looking at the request metric `LocalTimeMs`
provides a hint, but it may be useful to have a dedicated
`FetchDownConversionsMs` metric.

2. Large messages can cause GC issues (it's particularly bad if fetch down
conversion takes place). One can currently configure the max message batch
size per topic to keep this under control, but that is the size after
compression. However, we decompress the batch to validate produce requests
and we decompress and recompress during fetch downconversion). It would be
helpful to have topic metrics for the produce message batch size after
decompression (and perhaps compressed as well as that would help understand
the compression ratio).

3. Authentication success/failures per second. This is helpful to
understand if some clients are misconfigured or if bad actors are trying to
authenticate.

Thoughts?

Ismael



On Wed, Aug 23, 2017 at 2:53 AM, Jun Rao  wrote:

> Hi, Rajini,
>
> Yes, if those error metrics are registered dynamically, we could worry
> about expiration later.
>
> Thanks,
>
> Jun
>
> On Fri, Aug 18, 2017 at 1:55 AM, Rajini Sivaram 
> wrote:
>
> > Perhaps we could register dynamically for now. If we find that the cost
> of
> > retaining these is high, we can add the code to expire them later. Is
> that
> > ok?
> >
> > Regards,
> >
> > Rajini
> >
> >
> >
> > On Fri, Aug 18, 2017 at 9:41 AM, Ismael Juma  wrote:
> >
> > > Can we quantify the cost of having these metrics around if they are
> > > dynamically registered? Given that the maximum is bounded at
> development
> > > time, is it really worth all the extra code?
> > >
> > > Ismael
> > >
> > > On Fri, Aug 18, 2017 at 9:34 AM, Rajini Sivaram <
> rajinisiva...@gmail.com
> > >
> > > wrote:
> > >
> > > > Jun,
> > > >
> > > > It feels more consistent to add errors as yammer metrics similar to
> > other
> > > > request metrics. Perhaps we could add some code to track and remove
> > these
> > > > if unused? It is a bit more work, but it would keep the externals
> > > > consistent.
> > > >
> > > > Ismael/Manikumar,
> > > >
> > > > Agree that version as a String attribute makes more sense.
> > Unfortunately,
> > > > the whole KafkaMetric implementation is written around a single
> > "double"
> > > > type, so introducing a new type is a big change. But I suppose it can
> > be
> > > > done. I have updated the KIP.
> > > >
> > > > Regards,
> > > >
> > > > Rajini
> > > >
> > > >
> > > > On Fri, Aug 18, 2017 at 7:42 AM, Manikumar <
> manikumar.re...@gmail.com>
> > > > wrote:
> > > >
> > > > > I agree it will be good if we can add  "commit id/version" as an
> > > > > attribute value.
> > > > > It will be easy to parse. But as of now, KafkaMetric supports only
> > > > > numerical values.
> > > > >
> > > > > On Fri, Aug 18, 2017 at 5:49 AM, Ismael Juma 
> > > wrote:
> > > > >
> > > > > > Hi Rajini,
> > > > > >
> > > > > > About the gauges, I was thinking that the attribute would be the
> > > value
> > > > > > (i.e. commit id or version). I understand that Kafka Metrics
> > doesn't
> > > > > > support this (unlike Yammer Metrics), but would it make sense to
> > add?
> > > > > >
> > > > > > Ismael
> > > > > >
> > > > > > On Thu, Aug 17, 2017 at 2:54 PM, Rajini Sivaram <
> > > > rajinisiva...@gmail.com
> > > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Hi Ismael,
> > > > > > >
> > > > > > > Thank you for the review.
> > > > > > >
> > > > > > > 1. Agree on keeping it simple with dynamic registration and no
> > > > expiry.
> > > > > > Will
> > > > > > > wait for Jun's feedback before updating KIP.
> > > > > > > 2. I have switched to two metrics for commit-id and version
> (not
> > > sure
> > > > > if
> > > > > > it
> > > > > > > matches what you meant). I also added the client-id tag which
> is
> > > used
> > > > > in
> > > > > > > all metrics from clients.
> > > > > > >
> > > > > > > Regards,
> > > > > > >
> > > > > > > Rajini
> > > > > > >
> > > > > > >
> > > > > > > On Thu, Aug 17, 2017 at 10:55 AM, Ismael Juma <
> ism...@juma.me.uk
> > >
> > > > > wrote:
> > > > > > >
> > > > > > > > Thanks for the KIP, Rajini. I think this is helpful too. A
> few
> > > > minor
> > > > > > > > comments.
> > > > > > > >
> > > > > > > > 1. About the number of metrics and expiration, if we
> > dynamically
> > > > > > register
> > > > > > > > metrics for the error codes, the number is likely to be much
> > > lower
> > > > > than
> > > > > > > > 

Re: [DISCUSS] KIP-188 - Add new metrics to support health checks

2017-08-22 Thread Jun Rao
Hi, Rajini,

Yes, if those error metrics are registered dynamically, we could worry
about expiration later.

Thanks,

Jun

On Fri, Aug 18, 2017 at 1:55 AM, Rajini Sivaram 
wrote:

> Perhaps we could register dynamically for now. If we find that the cost of
> retaining these is high, we can add the code to expire them later. Is that
> ok?
>
> Regards,
>
> Rajini
>
>
>
> On Fri, Aug 18, 2017 at 9:41 AM, Ismael Juma  wrote:
>
> > Can we quantify the cost of having these metrics around if they are
> > dynamically registered? Given that the maximum is bounded at development
> > time, is it really worth all the extra code?
> >
> > Ismael
> >
> > On Fri, Aug 18, 2017 at 9:34 AM, Rajini Sivaram  >
> > wrote:
> >
> > > Jun,
> > >
> > > It feels more consistent to add errors as yammer metrics similar to
> other
> > > request metrics. Perhaps we could add some code to track and remove
> these
> > > if unused? It is a bit more work, but it would keep the externals
> > > consistent.
> > >
> > > Ismael/Manikumar,
> > >
> > > Agree that version as a String attribute makes more sense.
> Unfortunately,
> > > the whole KafkaMetric implementation is written around a single
> "double"
> > > type, so introducing a new type is a big change. But I suppose it can
> be
> > > done. I have updated the KIP.
> > >
> > > Regards,
> > >
> > > Rajini
> > >
> > >
> > > On Fri, Aug 18, 2017 at 7:42 AM, Manikumar 
> > > wrote:
> > >
> > > > I agree it will be good if we can add  "commit id/version" as an
> > > > attribute value.
> > > > It will be easy to parse. But as of now, KafkaMetric supports only
> > > > numerical values.
> > > >
> > > > On Fri, Aug 18, 2017 at 5:49 AM, Ismael Juma 
> > wrote:
> > > >
> > > > > Hi Rajini,
> > > > >
> > > > > About the gauges, I was thinking that the attribute would be the
> > value
> > > > > (i.e. commit id or version). I understand that Kafka Metrics
> doesn't
> > > > > support this (unlike Yammer Metrics), but would it make sense to
> add?
> > > > >
> > > > > Ismael
> > > > >
> > > > > On Thu, Aug 17, 2017 at 2:54 PM, Rajini Sivaram <
> > > rajinisiva...@gmail.com
> > > > >
> > > > > wrote:
> > > > >
> > > > > > Hi Ismael,
> > > > > >
> > > > > > Thank you for the review.
> > > > > >
> > > > > > 1. Agree on keeping it simple with dynamic registration and no
> > > expiry.
> > > > > Will
> > > > > > wait for Jun's feedback before updating KIP.
> > > > > > 2. I have switched to two metrics for commit-id and version (not
> > sure
> > > > if
> > > > > it
> > > > > > matches what you meant). I also added the client-id tag which is
> > used
> > > > in
> > > > > > all metrics from clients.
> > > > > >
> > > > > > Regards,
> > > > > >
> > > > > > Rajini
> > > > > >
> > > > > >
> > > > > > On Thu, Aug 17, 2017 at 10:55 AM, Ismael Juma  >
> > > > wrote:
> > > > > >
> > > > > > > Thanks for the KIP, Rajini. I think this is helpful too. A few
> > > minor
> > > > > > > comments.
> > > > > > >
> > > > > > > 1. About the number of metrics and expiration, if we
> dynamically
> > > > > register
> > > > > > > metrics for the error codes, the number is likely to be much
> > lower
> > > > than
> > > > > > > 30*30, probably less than 100. If we were using Kafka Metrics
> for
> > > > this,
> > > > > > we
> > > > > > > could easily add a long expiration period to be conservative,
> > but I
> > > > am
> > > > > > not
> > > > > > > sure this is supported by Yammer Metrics. If it is not, there's
> > an
> > > > > > argument
> > > > > > > for keeping it simple.
> > > > > > >
> > > > > > > 2. Would it make sense to use 2 gauges for the version and
> commit
> > > id?
> > > > > It
> > > > > > > seems more intuitive than having those values as tags.
> > > > > > >
> > > > > > > Ismael
> > > > > > >
> > > > > > > On Thu, Aug 17, 2017 at 10:19 AM, Rajini Sivaram <
> > > > > > rajinisiva...@gmail.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hi Jun,
> > > > > > > >
> > > > > > > > Thank you for the review.
> > > > > > > >
> > > > > > > > 1. Makes sense. I have updated the KIP.
> > > > > > > > 2. Moved to a new group ZooKeeperClient
> > > > > > > > 3. It is a gauge, so it will have a single attribute called
> > Value
> > > > > with
> > > > > > a
> > > > > > > > constant value of 1.
> > > > > > > >
> > > > > > > > Regards,
> > > > > > > >
> > > > > > > > Rajini
> > > > > > > >
> > > > > > > > On Thu, Aug 17, 2017 at 3:16 AM, Jun Rao 
> > > wrote:
> > > > > > > >
> > > > > > > > > Hi, Rajini,
> > > > > > > > >
> > > > > > > > > Thanks for the KIP. A few comments.
> > > > > > > > >
> > > > > > > > > 1. We have 30+ requests and 30+ error code and growing. So,
> > the
> > > > > > > > combination
> > > > > > > > > can be large. Perhaps it's useful to expire an error metric
> > if
> > > > it's
> > > > > > no
> > > > > > > > > longer updated after some time? We did something 

Re: [DISCUSS] KIP-188 - Add new metrics to support health checks

2017-08-22 Thread Rajini Sivaram
Hi all,

We can discuss the expiry implementation as part of the PR review. If there
are no further comments, I will start vote on this KIP tomorrow. Please let
me know if there are any other concerns.

On Fri, Aug 18, 2017 at 4:55 AM, Rajini Sivaram 
wrote:

> Perhaps we could register dynamically for now. If we find that the cost of
> retaining these is high, we can add the code to expire them later. Is that
> ok?
>
> Regards,
>
> Rajini
>
>
>
> On Fri, Aug 18, 2017 at 9:41 AM, Ismael Juma  wrote:
>
>> Can we quantify the cost of having these metrics around if they are
>> dynamically registered? Given that the maximum is bounded at development
>> time, is it really worth all the extra code?
>>
>> Ismael
>>
>> On Fri, Aug 18, 2017 at 9:34 AM, Rajini Sivaram 
>> wrote:
>>
>> > Jun,
>> >
>> > It feels more consistent to add errors as yammer metrics similar to
>> other
>> > request metrics. Perhaps we could add some code to track and remove
>> these
>> > if unused? It is a bit more work, but it would keep the externals
>> > consistent.
>> >
>> > Ismael/Manikumar,
>> >
>> > Agree that version as a String attribute makes more sense.
>> Unfortunately,
>> > the whole KafkaMetric implementation is written around a single "double"
>> > type, so introducing a new type is a big change. But I suppose it can be
>> > done. I have updated the KIP.
>> >
>> > Regards,
>> >
>> > Rajini
>> >
>> >
>> > On Fri, Aug 18, 2017 at 7:42 AM, Manikumar 
>> > wrote:
>> >
>> > > I agree it will be good if we can add  "commit id/version" as an
>> > > attribute value.
>> > > It will be easy to parse. But as of now, KafkaMetric supports only
>> > > numerical values.
>> > >
>> > > On Fri, Aug 18, 2017 at 5:49 AM, Ismael Juma 
>> wrote:
>> > >
>> > > > Hi Rajini,
>> > > >
>> > > > About the gauges, I was thinking that the attribute would be the
>> value
>> > > > (i.e. commit id or version). I understand that Kafka Metrics doesn't
>> > > > support this (unlike Yammer Metrics), but would it make sense to
>> add?
>> > > >
>> > > > Ismael
>> > > >
>> > > > On Thu, Aug 17, 2017 at 2:54 PM, Rajini Sivaram <
>> > rajinisiva...@gmail.com
>> > > >
>> > > > wrote:
>> > > >
>> > > > > Hi Ismael,
>> > > > >
>> > > > > Thank you for the review.
>> > > > >
>> > > > > 1. Agree on keeping it simple with dynamic registration and no
>> > expiry.
>> > > > Will
>> > > > > wait for Jun's feedback before updating KIP.
>> > > > > 2. I have switched to two metrics for commit-id and version (not
>> sure
>> > > if
>> > > > it
>> > > > > matches what you meant). I also added the client-id tag which is
>> used
>> > > in
>> > > > > all metrics from clients.
>> > > > >
>> > > > > Regards,
>> > > > >
>> > > > > Rajini
>> > > > >
>> > > > >
>> > > > > On Thu, Aug 17, 2017 at 10:55 AM, Ismael Juma 
>> > > wrote:
>> > > > >
>> > > > > > Thanks for the KIP, Rajini. I think this is helpful too. A few
>> > minor
>> > > > > > comments.
>> > > > > >
>> > > > > > 1. About the number of metrics and expiration, if we dynamically
>> > > > register
>> > > > > > metrics for the error codes, the number is likely to be much
>> lower
>> > > than
>> > > > > > 30*30, probably less than 100. If we were using Kafka Metrics
>> for
>> > > this,
>> > > > > we
>> > > > > > could easily add a long expiration period to be conservative,
>> but I
>> > > am
>> > > > > not
>> > > > > > sure this is supported by Yammer Metrics. If it is not, there's
>> an
>> > > > > argument
>> > > > > > for keeping it simple.
>> > > > > >
>> > > > > > 2. Would it make sense to use 2 gauges for the version and
>> commit
>> > id?
>> > > > It
>> > > > > > seems more intuitive than having those values as tags.
>> > > > > >
>> > > > > > Ismael
>> > > > > >
>> > > > > > On Thu, Aug 17, 2017 at 10:19 AM, Rajini Sivaram <
>> > > > > rajinisiva...@gmail.com>
>> > > > > > wrote:
>> > > > > >
>> > > > > > > Hi Jun,
>> > > > > > >
>> > > > > > > Thank you for the review.
>> > > > > > >
>> > > > > > > 1. Makes sense. I have updated the KIP.
>> > > > > > > 2. Moved to a new group ZooKeeperClient
>> > > > > > > 3. It is a gauge, so it will have a single attribute called
>> Value
>> > > > with
>> > > > > a
>> > > > > > > constant value of 1.
>> > > > > > >
>> > > > > > > Regards,
>> > > > > > >
>> > > > > > > Rajini
>> > > > > > >
>> > > > > > > On Thu, Aug 17, 2017 at 3:16 AM, Jun Rao 
>> > wrote:
>> > > > > > >
>> > > > > > > > Hi, Rajini,
>> > > > > > > >
>> > > > > > > > Thanks for the KIP. A few comments.
>> > > > > > > >
>> > > > > > > > 1. We have 30+ requests and 30+ error code and growing. So,
>> the
>> > > > > > > combination
>> > > > > > > > can be large. Perhaps it's useful to expire an error metric
>> if
>> > > it's
>> > > > > no
>> > > > > > > > longer updated after some time? We did something similar for
>> > the
>> > > > > quota
>> > > > > > > > metric.
>> > > > > > 

Re: [DISCUSS] KIP-188 - Add new metrics to support health checks

2017-08-18 Thread Rajini Sivaram
Perhaps we could register dynamically for now. If we find that the cost of
retaining these is high, we can add the code to expire them later. Is that
ok?

Regards,

Rajini



On Fri, Aug 18, 2017 at 9:41 AM, Ismael Juma  wrote:

> Can we quantify the cost of having these metrics around if they are
> dynamically registered? Given that the maximum is bounded at development
> time, is it really worth all the extra code?
>
> Ismael
>
> On Fri, Aug 18, 2017 at 9:34 AM, Rajini Sivaram 
> wrote:
>
> > Jun,
> >
> > It feels more consistent to add errors as yammer metrics similar to other
> > request metrics. Perhaps we could add some code to track and remove these
> > if unused? It is a bit more work, but it would keep the externals
> > consistent.
> >
> > Ismael/Manikumar,
> >
> > Agree that version as a String attribute makes more sense. Unfortunately,
> > the whole KafkaMetric implementation is written around a single "double"
> > type, so introducing a new type is a big change. But I suppose it can be
> > done. I have updated the KIP.
> >
> > Regards,
> >
> > Rajini
> >
> >
> > On Fri, Aug 18, 2017 at 7:42 AM, Manikumar 
> > wrote:
> >
> > > I agree it will be good if we can add  "commit id/version" as an
> > > attribute value.
> > > It will be easy to parse. But as of now, KafkaMetric supports only
> > > numerical values.
> > >
> > > On Fri, Aug 18, 2017 at 5:49 AM, Ismael Juma 
> wrote:
> > >
> > > > Hi Rajini,
> > > >
> > > > About the gauges, I was thinking that the attribute would be the
> value
> > > > (i.e. commit id or version). I understand that Kafka Metrics doesn't
> > > > support this (unlike Yammer Metrics), but would it make sense to add?
> > > >
> > > > Ismael
> > > >
> > > > On Thu, Aug 17, 2017 at 2:54 PM, Rajini Sivaram <
> > rajinisiva...@gmail.com
> > > >
> > > > wrote:
> > > >
> > > > > Hi Ismael,
> > > > >
> > > > > Thank you for the review.
> > > > >
> > > > > 1. Agree on keeping it simple with dynamic registration and no
> > expiry.
> > > > Will
> > > > > wait for Jun's feedback before updating KIP.
> > > > > 2. I have switched to two metrics for commit-id and version (not
> sure
> > > if
> > > > it
> > > > > matches what you meant). I also added the client-id tag which is
> used
> > > in
> > > > > all metrics from clients.
> > > > >
> > > > > Regards,
> > > > >
> > > > > Rajini
> > > > >
> > > > >
> > > > > On Thu, Aug 17, 2017 at 10:55 AM, Ismael Juma 
> > > wrote:
> > > > >
> > > > > > Thanks for the KIP, Rajini. I think this is helpful too. A few
> > minor
> > > > > > comments.
> > > > > >
> > > > > > 1. About the number of metrics and expiration, if we dynamically
> > > > register
> > > > > > metrics for the error codes, the number is likely to be much
> lower
> > > than
> > > > > > 30*30, probably less than 100. If we were using Kafka Metrics for
> > > this,
> > > > > we
> > > > > > could easily add a long expiration period to be conservative,
> but I
> > > am
> > > > > not
> > > > > > sure this is supported by Yammer Metrics. If it is not, there's
> an
> > > > > argument
> > > > > > for keeping it simple.
> > > > > >
> > > > > > 2. Would it make sense to use 2 gauges for the version and commit
> > id?
> > > > It
> > > > > > seems more intuitive than having those values as tags.
> > > > > >
> > > > > > Ismael
> > > > > >
> > > > > > On Thu, Aug 17, 2017 at 10:19 AM, Rajini Sivaram <
> > > > > rajinisiva...@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > Hi Jun,
> > > > > > >
> > > > > > > Thank you for the review.
> > > > > > >
> > > > > > > 1. Makes sense. I have updated the KIP.
> > > > > > > 2. Moved to a new group ZooKeeperClient
> > > > > > > 3. It is a gauge, so it will have a single attribute called
> Value
> > > > with
> > > > > a
> > > > > > > constant value of 1.
> > > > > > >
> > > > > > > Regards,
> > > > > > >
> > > > > > > Rajini
> > > > > > >
> > > > > > > On Thu, Aug 17, 2017 at 3:16 AM, Jun Rao 
> > wrote:
> > > > > > >
> > > > > > > > Hi, Rajini,
> > > > > > > >
> > > > > > > > Thanks for the KIP. A few comments.
> > > > > > > >
> > > > > > > > 1. We have 30+ requests and 30+ error code and growing. So,
> the
> > > > > > > combination
> > > > > > > > can be large. Perhaps it's useful to expire an error metric
> if
> > > it's
> > > > > no
> > > > > > > > longer updated after some time? We did something similar for
> > the
> > > > > quota
> > > > > > > > metric.
> > > > > > > >
> > > > > > > > 2. It's a bit weird to put the ZK latency metric under
> > > > > > > > type=SessionExpireListener.
> > > > > > > > Perhaps it's more intuitive to put it in a separate type.
> > > > > > > >
> > > > > > > > 3. For the client version metric, since we representing
> > commit_id
> > > > and
> > > > > > > > version as tags in the metric name. So the mbean will have no
> > > > > > attributes?
> > > > > > > >
> > > > > > > > Jun
> > > > > > > >
> > 

Re: [DISCUSS] KIP-188 - Add new metrics to support health checks

2017-08-18 Thread Ismael Juma
Can we quantify the cost of having these metrics around if they are
dynamically registered? Given that the maximum is bounded at development
time, is it really worth all the extra code?

Ismael

On Fri, Aug 18, 2017 at 9:34 AM, Rajini Sivaram 
wrote:

> Jun,
>
> It feels more consistent to add errors as yammer metrics similar to other
> request metrics. Perhaps we could add some code to track and remove these
> if unused? It is a bit more work, but it would keep the externals
> consistent.
>
> Ismael/Manikumar,
>
> Agree that version as a String attribute makes more sense. Unfortunately,
> the whole KafkaMetric implementation is written around a single "double"
> type, so introducing a new type is a big change. But I suppose it can be
> done. I have updated the KIP.
>
> Regards,
>
> Rajini
>
>
> On Fri, Aug 18, 2017 at 7:42 AM, Manikumar 
> wrote:
>
> > I agree it will be good if we can add  "commit id/version" as an
> > attribute value.
> > It will be easy to parse. But as of now, KafkaMetric supports only
> > numerical values.
> >
> > On Fri, Aug 18, 2017 at 5:49 AM, Ismael Juma  wrote:
> >
> > > Hi Rajini,
> > >
> > > About the gauges, I was thinking that the attribute would be the value
> > > (i.e. commit id or version). I understand that Kafka Metrics doesn't
> > > support this (unlike Yammer Metrics), but would it make sense to add?
> > >
> > > Ismael
> > >
> > > On Thu, Aug 17, 2017 at 2:54 PM, Rajini Sivaram <
> rajinisiva...@gmail.com
> > >
> > > wrote:
> > >
> > > > Hi Ismael,
> > > >
> > > > Thank you for the review.
> > > >
> > > > 1. Agree on keeping it simple with dynamic registration and no
> expiry.
> > > Will
> > > > wait for Jun's feedback before updating KIP.
> > > > 2. I have switched to two metrics for commit-id and version (not sure
> > if
> > > it
> > > > matches what you meant). I also added the client-id tag which is used
> > in
> > > > all metrics from clients.
> > > >
> > > > Regards,
> > > >
> > > > Rajini
> > > >
> > > >
> > > > On Thu, Aug 17, 2017 at 10:55 AM, Ismael Juma 
> > wrote:
> > > >
> > > > > Thanks for the KIP, Rajini. I think this is helpful too. A few
> minor
> > > > > comments.
> > > > >
> > > > > 1. About the number of metrics and expiration, if we dynamically
> > > register
> > > > > metrics for the error codes, the number is likely to be much lower
> > than
> > > > > 30*30, probably less than 100. If we were using Kafka Metrics for
> > this,
> > > > we
> > > > > could easily add a long expiration period to be conservative, but I
> > am
> > > > not
> > > > > sure this is supported by Yammer Metrics. If it is not, there's an
> > > > argument
> > > > > for keeping it simple.
> > > > >
> > > > > 2. Would it make sense to use 2 gauges for the version and commit
> id?
> > > It
> > > > > seems more intuitive than having those values as tags.
> > > > >
> > > > > Ismael
> > > > >
> > > > > On Thu, Aug 17, 2017 at 10:19 AM, Rajini Sivaram <
> > > > rajinisiva...@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Hi Jun,
> > > > > >
> > > > > > Thank you for the review.
> > > > > >
> > > > > > 1. Makes sense. I have updated the KIP.
> > > > > > 2. Moved to a new group ZooKeeperClient
> > > > > > 3. It is a gauge, so it will have a single attribute called Value
> > > with
> > > > a
> > > > > > constant value of 1.
> > > > > >
> > > > > > Regards,
> > > > > >
> > > > > > Rajini
> > > > > >
> > > > > > On Thu, Aug 17, 2017 at 3:16 AM, Jun Rao 
> wrote:
> > > > > >
> > > > > > > Hi, Rajini,
> > > > > > >
> > > > > > > Thanks for the KIP. A few comments.
> > > > > > >
> > > > > > > 1. We have 30+ requests and 30+ error code and growing. So, the
> > > > > > combination
> > > > > > > can be large. Perhaps it's useful to expire an error metric if
> > it's
> > > > no
> > > > > > > longer updated after some time? We did something similar for
> the
> > > > quota
> > > > > > > metric.
> > > > > > >
> > > > > > > 2. It's a bit weird to put the ZK latency metric under
> > > > > > > type=SessionExpireListener.
> > > > > > > Perhaps it's more intuitive to put it in a separate type.
> > > > > > >
> > > > > > > 3. For the client version metric, since we representing
> commit_id
> > > and
> > > > > > > version as tags in the metric name. So the mbean will have no
> > > > > attributes?
> > > > > > >
> > > > > > > Jun
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Wed, Aug 16, 2017 at 4:05 PM, Roger Hoover <
> > > > roger.hoo...@gmail.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > I think it would useful to make clear somewhere for each
> > metric,
> > > > the
> > > > > > > level
> > > > > > > > at which it's counted.  I don't know all the details of the
> > Kafka
> > > > > > > protocol
> > > > > > > > but it might be something like
> > > > > > > >
> > > > > > > > ProduceRequest, Fetch Request - counted at per-partition
> level
> > > > > > > > All other requests 

Re: [DISCUSS] KIP-188 - Add new metrics to support health checks

2017-08-18 Thread Rajini Sivaram
Jun,

It feels more consistent to add errors as yammer metrics similar to other
request metrics. Perhaps we could add some code to track and remove these
if unused? It is a bit more work, but it would keep the externals
consistent.

Ismael/Manikumar,

Agree that version as a String attribute makes more sense. Unfortunately,
the whole KafkaMetric implementation is written around a single "double"
type, so introducing a new type is a big change. But I suppose it can be
done. I have updated the KIP.

Regards,

Rajini


On Fri, Aug 18, 2017 at 7:42 AM, Manikumar 
wrote:

> I agree it will be good if we can add  "commit id/version" as an
> attribute value.
> It will be easy to parse. But as of now, KafkaMetric supports only
> numerical values.
>
> On Fri, Aug 18, 2017 at 5:49 AM, Ismael Juma  wrote:
>
> > Hi Rajini,
> >
> > About the gauges, I was thinking that the attribute would be the value
> > (i.e. commit id or version). I understand that Kafka Metrics doesn't
> > support this (unlike Yammer Metrics), but would it make sense to add?
> >
> > Ismael
> >
> > On Thu, Aug 17, 2017 at 2:54 PM, Rajini Sivaram  >
> > wrote:
> >
> > > Hi Ismael,
> > >
> > > Thank you for the review.
> > >
> > > 1. Agree on keeping it simple with dynamic registration and no expiry.
> > Will
> > > wait for Jun's feedback before updating KIP.
> > > 2. I have switched to two metrics for commit-id and version (not sure
> if
> > it
> > > matches what you meant). I also added the client-id tag which is used
> in
> > > all metrics from clients.
> > >
> > > Regards,
> > >
> > > Rajini
> > >
> > >
> > > On Thu, Aug 17, 2017 at 10:55 AM, Ismael Juma 
> wrote:
> > >
> > > > Thanks for the KIP, Rajini. I think this is helpful too. A few minor
> > > > comments.
> > > >
> > > > 1. About the number of metrics and expiration, if we dynamically
> > register
> > > > metrics for the error codes, the number is likely to be much lower
> than
> > > > 30*30, probably less than 100. If we were using Kafka Metrics for
> this,
> > > we
> > > > could easily add a long expiration period to be conservative, but I
> am
> > > not
> > > > sure this is supported by Yammer Metrics. If it is not, there's an
> > > argument
> > > > for keeping it simple.
> > > >
> > > > 2. Would it make sense to use 2 gauges for the version and commit id?
> > It
> > > > seems more intuitive than having those values as tags.
> > > >
> > > > Ismael
> > > >
> > > > On Thu, Aug 17, 2017 at 10:19 AM, Rajini Sivaram <
> > > rajinisiva...@gmail.com>
> > > > wrote:
> > > >
> > > > > Hi Jun,
> > > > >
> > > > > Thank you for the review.
> > > > >
> > > > > 1. Makes sense. I have updated the KIP.
> > > > > 2. Moved to a new group ZooKeeperClient
> > > > > 3. It is a gauge, so it will have a single attribute called Value
> > with
> > > a
> > > > > constant value of 1.
> > > > >
> > > > > Regards,
> > > > >
> > > > > Rajini
> > > > >
> > > > > On Thu, Aug 17, 2017 at 3:16 AM, Jun Rao  wrote:
> > > > >
> > > > > > Hi, Rajini,
> > > > > >
> > > > > > Thanks for the KIP. A few comments.
> > > > > >
> > > > > > 1. We have 30+ requests and 30+ error code and growing. So, the
> > > > > combination
> > > > > > can be large. Perhaps it's useful to expire an error metric if
> it's
> > > no
> > > > > > longer updated after some time? We did something similar for the
> > > quota
> > > > > > metric.
> > > > > >
> > > > > > 2. It's a bit weird to put the ZK latency metric under
> > > > > > type=SessionExpireListener.
> > > > > > Perhaps it's more intuitive to put it in a separate type.
> > > > > >
> > > > > > 3. For the client version metric, since we representing commit_id
> > and
> > > > > > version as tags in the metric name. So the mbean will have no
> > > > attributes?
> > > > > >
> > > > > > Jun
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Wed, Aug 16, 2017 at 4:05 PM, Roger Hoover <
> > > roger.hoo...@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > I think it would useful to make clear somewhere for each
> metric,
> > > the
> > > > > > level
> > > > > > > at which it's counted.  I don't know all the details of the
> Kafka
> > > > > > protocol
> > > > > > > but it might be something like
> > > > > > >
> > > > > > > ProduceRequest, Fetch Request - counted at per-partition level
> > > > > > > All other requests are 1:1 with client requests?
> > > > > > >
> > > > > > > Cheers,
> > > > > > >
> > > > > > > Roger
> > > > > > >
> > > > > > > On Wed, Aug 16, 2017 at 4:02 PM, Roger Hoover <
> > > > roger.hoo...@gmail.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Rajini,
> > > > > > > >
> > > > > > > > Thank you for the KIP.  These are very helpful additions.
> One
> > > > > question
> > > > > > > on
> > > > > > > > the error code metrics:
> > > > > > > >
> > > > > > > > Will the total error counting happen at the the level of
> topic
> > > > > > partition?
> > > > > > > > For 

Re: [DISCUSS] KIP-188 - Add new metrics to support health checks

2017-08-18 Thread Manikumar
I agree it will be good if we can add  "commit id/version" as an
attribute value.
It will be easy to parse. But as of now, KafkaMetric supports only
numerical values.

On Fri, Aug 18, 2017 at 5:49 AM, Ismael Juma  wrote:

> Hi Rajini,
>
> About the gauges, I was thinking that the attribute would be the value
> (i.e. commit id or version). I understand that Kafka Metrics doesn't
> support this (unlike Yammer Metrics), but would it make sense to add?
>
> Ismael
>
> On Thu, Aug 17, 2017 at 2:54 PM, Rajini Sivaram 
> wrote:
>
> > Hi Ismael,
> >
> > Thank you for the review.
> >
> > 1. Agree on keeping it simple with dynamic registration and no expiry.
> Will
> > wait for Jun's feedback before updating KIP.
> > 2. I have switched to two metrics for commit-id and version (not sure if
> it
> > matches what you meant). I also added the client-id tag which is used in
> > all metrics from clients.
> >
> > Regards,
> >
> > Rajini
> >
> >
> > On Thu, Aug 17, 2017 at 10:55 AM, Ismael Juma  wrote:
> >
> > > Thanks for the KIP, Rajini. I think this is helpful too. A few minor
> > > comments.
> > >
> > > 1. About the number of metrics and expiration, if we dynamically
> register
> > > metrics for the error codes, the number is likely to be much lower than
> > > 30*30, probably less than 100. If we were using Kafka Metrics for this,
> > we
> > > could easily add a long expiration period to be conservative, but I am
> > not
> > > sure this is supported by Yammer Metrics. If it is not, there's an
> > argument
> > > for keeping it simple.
> > >
> > > 2. Would it make sense to use 2 gauges for the version and commit id?
> It
> > > seems more intuitive than having those values as tags.
> > >
> > > Ismael
> > >
> > > On Thu, Aug 17, 2017 at 10:19 AM, Rajini Sivaram <
> > rajinisiva...@gmail.com>
> > > wrote:
> > >
> > > > Hi Jun,
> > > >
> > > > Thank you for the review.
> > > >
> > > > 1. Makes sense. I have updated the KIP.
> > > > 2. Moved to a new group ZooKeeperClient
> > > > 3. It is a gauge, so it will have a single attribute called Value
> with
> > a
> > > > constant value of 1.
> > > >
> > > > Regards,
> > > >
> > > > Rajini
> > > >
> > > > On Thu, Aug 17, 2017 at 3:16 AM, Jun Rao  wrote:
> > > >
> > > > > Hi, Rajini,
> > > > >
> > > > > Thanks for the KIP. A few comments.
> > > > >
> > > > > 1. We have 30+ requests and 30+ error code and growing. So, the
> > > > combination
> > > > > can be large. Perhaps it's useful to expire an error metric if it's
> > no
> > > > > longer updated after some time? We did something similar for the
> > quota
> > > > > metric.
> > > > >
> > > > > 2. It's a bit weird to put the ZK latency metric under
> > > > > type=SessionExpireListener.
> > > > > Perhaps it's more intuitive to put it in a separate type.
> > > > >
> > > > > 3. For the client version metric, since we representing commit_id
> and
> > > > > version as tags in the metric name. So the mbean will have no
> > > attributes?
> > > > >
> > > > > Jun
> > > > >
> > > > >
> > > > >
> > > > > On Wed, Aug 16, 2017 at 4:05 PM, Roger Hoover <
> > roger.hoo...@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > I think it would useful to make clear somewhere for each metric,
> > the
> > > > > level
> > > > > > at which it's counted.  I don't know all the details of the Kafka
> > > > > protocol
> > > > > > but it might be something like
> > > > > >
> > > > > > ProduceRequest, Fetch Request - counted at per-partition level
> > > > > > All other requests are 1:1 with client requests?
> > > > > >
> > > > > > Cheers,
> > > > > >
> > > > > > Roger
> > > > > >
> > > > > > On Wed, Aug 16, 2017 at 4:02 PM, Roger Hoover <
> > > roger.hoo...@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > Rajini,
> > > > > > >
> > > > > > > Thank you for the KIP.  These are very helpful additions.  One
> > > > question
> > > > > > on
> > > > > > > the error code metrics:
> > > > > > >
> > > > > > > Will the total error counting happen at the the level of topic
> > > > > partition?
> > > > > > > For example, if a single ProduceRequest contains messages to
> > append
> > > > to
> > > > > 3
> > > > > > > partitions and say all 3 appends are successful, the counter
> > > > > > > for kafka.network:type=RequestMetrics,name=
> ErrorsPerSec,request=
> > > > > > ProduceRequest,error=0
> > > > > > > will be incremented by 3?
> > > > > > >
> > > > > > > Thanks,
> > > > > > >
> > > > > > > Roger
> > > > > > >
> > > > > > > On Wed, Aug 16, 2017 at 12:10 PM, Rajini Sivaram <
> > > > > > rajinisiva...@gmail.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > >> I have created a KIP to add some additional metrics to support
> > > > health
> > > > > > >> checks:
> > > > > > >>
> > > > > > >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-188+-+
> > > > > > >> Add+new+metrics+to+support+health+checks
> > > > > > >>
> > > > > > >> Feedback and suggestions are welcome.
> > > > > > >>
> > > > 

Re: [DISCUSS] KIP-188 - Add new metrics to support health checks

2017-08-17 Thread Ismael Juma
Hi Rajini,

About the gauges, I was thinking that the attribute would be the value
(i.e. commit id or version). I understand that Kafka Metrics doesn't
support this (unlike Yammer Metrics), but would it make sense to add?

Ismael

On Thu, Aug 17, 2017 at 2:54 PM, Rajini Sivaram 
wrote:

> Hi Ismael,
>
> Thank you for the review.
>
> 1. Agree on keeping it simple with dynamic registration and no expiry. Will
> wait for Jun's feedback before updating KIP.
> 2. I have switched to two metrics for commit-id and version (not sure if it
> matches what you meant). I also added the client-id tag which is used in
> all metrics from clients.
>
> Regards,
>
> Rajini
>
>
> On Thu, Aug 17, 2017 at 10:55 AM, Ismael Juma  wrote:
>
> > Thanks for the KIP, Rajini. I think this is helpful too. A few minor
> > comments.
> >
> > 1. About the number of metrics and expiration, if we dynamically register
> > metrics for the error codes, the number is likely to be much lower than
> > 30*30, probably less than 100. If we were using Kafka Metrics for this,
> we
> > could easily add a long expiration period to be conservative, but I am
> not
> > sure this is supported by Yammer Metrics. If it is not, there's an
> argument
> > for keeping it simple.
> >
> > 2. Would it make sense to use 2 gauges for the version and commit id? It
> > seems more intuitive than having those values as tags.
> >
> > Ismael
> >
> > On Thu, Aug 17, 2017 at 10:19 AM, Rajini Sivaram <
> rajinisiva...@gmail.com>
> > wrote:
> >
> > > Hi Jun,
> > >
> > > Thank you for the review.
> > >
> > > 1. Makes sense. I have updated the KIP.
> > > 2. Moved to a new group ZooKeeperClient
> > > 3. It is a gauge, so it will have a single attribute called Value with
> a
> > > constant value of 1.
> > >
> > > Regards,
> > >
> > > Rajini
> > >
> > > On Thu, Aug 17, 2017 at 3:16 AM, Jun Rao  wrote:
> > >
> > > > Hi, Rajini,
> > > >
> > > > Thanks for the KIP. A few comments.
> > > >
> > > > 1. We have 30+ requests and 30+ error code and growing. So, the
> > > combination
> > > > can be large. Perhaps it's useful to expire an error metric if it's
> no
> > > > longer updated after some time? We did something similar for the
> quota
> > > > metric.
> > > >
> > > > 2. It's a bit weird to put the ZK latency metric under
> > > > type=SessionExpireListener.
> > > > Perhaps it's more intuitive to put it in a separate type.
> > > >
> > > > 3. For the client version metric, since we representing commit_id and
> > > > version as tags in the metric name. So the mbean will have no
> > attributes?
> > > >
> > > > Jun
> > > >
> > > >
> > > >
> > > > On Wed, Aug 16, 2017 at 4:05 PM, Roger Hoover <
> roger.hoo...@gmail.com>
> > > > wrote:
> > > >
> > > > > I think it would useful to make clear somewhere for each metric,
> the
> > > > level
> > > > > at which it's counted.  I don't know all the details of the Kafka
> > > > protocol
> > > > > but it might be something like
> > > > >
> > > > > ProduceRequest, Fetch Request - counted at per-partition level
> > > > > All other requests are 1:1 with client requests?
> > > > >
> > > > > Cheers,
> > > > >
> > > > > Roger
> > > > >
> > > > > On Wed, Aug 16, 2017 at 4:02 PM, Roger Hoover <
> > roger.hoo...@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Rajini,
> > > > > >
> > > > > > Thank you for the KIP.  These are very helpful additions.  One
> > > question
> > > > > on
> > > > > > the error code metrics:
> > > > > >
> > > > > > Will the total error counting happen at the the level of topic
> > > > partition?
> > > > > > For example, if a single ProduceRequest contains messages to
> append
> > > to
> > > > 3
> > > > > > partitions and say all 3 appends are successful, the counter
> > > > > > for kafka.network:type=RequestMetrics,name=ErrorsPerSec,request=
> > > > > ProduceRequest,error=0
> > > > > > will be incremented by 3?
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > Roger
> > > > > >
> > > > > > On Wed, Aug 16, 2017 at 12:10 PM, Rajini Sivaram <
> > > > > rajinisiva...@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > >> I have created a KIP to add some additional metrics to support
> > > health
> > > > > >> checks:
> > > > > >>
> > > > > >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-188+-+
> > > > > >> Add+new+metrics+to+support+health+checks
> > > > > >>
> > > > > >> Feedback and suggestions are welcome.
> > > > > >>
> > > > > >> Regards,
> > > > > >>
> > > > > >> Rajini
> > > > > >>
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>


Re: [DISCUSS] KIP-188 - Add new metrics to support health checks

2017-08-17 Thread Jun Rao
Hi, Rajini,

1. Yammer doesn't seem to support expiration. I guess we can use the Kafka
metric in this case.

Thanks,

Jun

On Thu, Aug 17, 2017 at 4:53 PM, Roger Hoover 
wrote:

> Rajini,
>
> The table is super helpful.  Thank you.
>
> On Thu, Aug 17, 2017 at 2:16 AM, Rajini Sivaram 
> wrote:
>
> > Hi Roger,
> >
> > Thank you for the review. I have added a table with the scope of errors
> > counted for each request.
> >
> > Regards,
> >
> > Rajini
> >
> > On Thu, Aug 17, 2017 at 12:05 AM, Roger Hoover 
> > wrote:
> >
> > > I think it would useful to make clear somewhere for each metric, the
> > level
> > > at which it's counted.  I don't know all the details of the Kafka
> > protocol
> > > but it might be something like
> > >
> > > ProduceRequest, Fetch Request - counted at per-partition level
> > > All other requests are 1:1 with client requests?
> > >
> > > Cheers,
> > >
> > > Roger
> > >
> > > On Wed, Aug 16, 2017 at 4:02 PM, Roger Hoover 
> > > wrote:
> > >
> > > > Rajini,
> > > >
> > > > Thank you for the KIP.  These are very helpful additions.  One
> question
> > > on
> > > > the error code metrics:
> > > >
> > > > Will the total error counting happen at the the level of topic
> > partition?
> > > > For example, if a single ProduceRequest contains messages to append
> to
> > 3
> > > > partitions and say all 3 appends are successful, the counter
> > > > for kafka.network:type=RequestMetrics,name=ErrorsPerSec,request=
> > > ProduceRequest,error=0
> > > > will be incremented by 3?
> > > >
> > > > Thanks,
> > > >
> > > > Roger
> > > >
> > > > On Wed, Aug 16, 2017 at 12:10 PM, Rajini Sivaram <
> > > rajinisiva...@gmail.com>
> > > > wrote:
> > > >
> > > >> I have created a KIP to add some additional metrics to support
> health
> > > >> checks:
> > > >>
> > > >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-188+-+
> > > >> Add+new+metrics+to+support+health+checks
> > > >>
> > > >> Feedback and suggestions are welcome.
> > > >>
> > > >> Regards,
> > > >>
> > > >> Rajini
> > > >>
> > > >
> > > >
> > >
> >
>


Re: [DISCUSS] KIP-188 - Add new metrics to support health checks

2017-08-17 Thread Roger Hoover
Rajini,

The table is super helpful.  Thank you.

On Thu, Aug 17, 2017 at 2:16 AM, Rajini Sivaram 
wrote:

> Hi Roger,
>
> Thank you for the review. I have added a table with the scope of errors
> counted for each request.
>
> Regards,
>
> Rajini
>
> On Thu, Aug 17, 2017 at 12:05 AM, Roger Hoover 
> wrote:
>
> > I think it would useful to make clear somewhere for each metric, the
> level
> > at which it's counted.  I don't know all the details of the Kafka
> protocol
> > but it might be something like
> >
> > ProduceRequest, Fetch Request - counted at per-partition level
> > All other requests are 1:1 with client requests?
> >
> > Cheers,
> >
> > Roger
> >
> > On Wed, Aug 16, 2017 at 4:02 PM, Roger Hoover 
> > wrote:
> >
> > > Rajini,
> > >
> > > Thank you for the KIP.  These are very helpful additions.  One question
> > on
> > > the error code metrics:
> > >
> > > Will the total error counting happen at the the level of topic
> partition?
> > > For example, if a single ProduceRequest contains messages to append to
> 3
> > > partitions and say all 3 appends are successful, the counter
> > > for kafka.network:type=RequestMetrics,name=ErrorsPerSec,request=
> > ProduceRequest,error=0
> > > will be incremented by 3?
> > >
> > > Thanks,
> > >
> > > Roger
> > >
> > > On Wed, Aug 16, 2017 at 12:10 PM, Rajini Sivaram <
> > rajinisiva...@gmail.com>
> > > wrote:
> > >
> > >> I have created a KIP to add some additional metrics to support health
> > >> checks:
> > >>
> > >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-188+-+
> > >> Add+new+metrics+to+support+health+checks
> > >>
> > >> Feedback and suggestions are welcome.
> > >>
> > >> Regards,
> > >>
> > >> Rajini
> > >>
> > >
> > >
> >
>


Re: [DISCUSS] KIP-188 - Add new metrics to support health checks

2017-08-17 Thread Rajini Sivaram
Hi Ismael,

Thank you for the review.

1. Agree on keeping it simple with dynamic registration and no expiry. Will
wait for Jun's feedback before updating KIP.
2. I have switched to two metrics for commit-id and version (not sure if it
matches what you meant). I also added the client-id tag which is used in
all metrics from clients.

Regards,

Rajini


On Thu, Aug 17, 2017 at 10:55 AM, Ismael Juma  wrote:

> Thanks for the KIP, Rajini. I think this is helpful too. A few minor
> comments.
>
> 1. About the number of metrics and expiration, if we dynamically register
> metrics for the error codes, the number is likely to be much lower than
> 30*30, probably less than 100. If we were using Kafka Metrics for this, we
> could easily add a long expiration period to be conservative, but I am not
> sure this is supported by Yammer Metrics. If it is not, there's an argument
> for keeping it simple.
>
> 2. Would it make sense to use 2 gauges for the version and commit id? It
> seems more intuitive than having those values as tags.
>
> Ismael
>
> On Thu, Aug 17, 2017 at 10:19 AM, Rajini Sivaram 
> wrote:
>
> > Hi Jun,
> >
> > Thank you for the review.
> >
> > 1. Makes sense. I have updated the KIP.
> > 2. Moved to a new group ZooKeeperClient
> > 3. It is a gauge, so it will have a single attribute called Value with a
> > constant value of 1.
> >
> > Regards,
> >
> > Rajini
> >
> > On Thu, Aug 17, 2017 at 3:16 AM, Jun Rao  wrote:
> >
> > > Hi, Rajini,
> > >
> > > Thanks for the KIP. A few comments.
> > >
> > > 1. We have 30+ requests and 30+ error code and growing. So, the
> > combination
> > > can be large. Perhaps it's useful to expire an error metric if it's no
> > > longer updated after some time? We did something similar for the quota
> > > metric.
> > >
> > > 2. It's a bit weird to put the ZK latency metric under
> > > type=SessionExpireListener.
> > > Perhaps it's more intuitive to put it in a separate type.
> > >
> > > 3. For the client version metric, since we representing commit_id and
> > > version as tags in the metric name. So the mbean will have no
> attributes?
> > >
> > > Jun
> > >
> > >
> > >
> > > On Wed, Aug 16, 2017 at 4:05 PM, Roger Hoover 
> > > wrote:
> > >
> > > > I think it would useful to make clear somewhere for each metric, the
> > > level
> > > > at which it's counted.  I don't know all the details of the Kafka
> > > protocol
> > > > but it might be something like
> > > >
> > > > ProduceRequest, Fetch Request - counted at per-partition level
> > > > All other requests are 1:1 with client requests?
> > > >
> > > > Cheers,
> > > >
> > > > Roger
> > > >
> > > > On Wed, Aug 16, 2017 at 4:02 PM, Roger Hoover <
> roger.hoo...@gmail.com>
> > > > wrote:
> > > >
> > > > > Rajini,
> > > > >
> > > > > Thank you for the KIP.  These are very helpful additions.  One
> > question
> > > > on
> > > > > the error code metrics:
> > > > >
> > > > > Will the total error counting happen at the the level of topic
> > > partition?
> > > > > For example, if a single ProduceRequest contains messages to append
> > to
> > > 3
> > > > > partitions and say all 3 appends are successful, the counter
> > > > > for kafka.network:type=RequestMetrics,name=ErrorsPerSec,request=
> > > > ProduceRequest,error=0
> > > > > will be incremented by 3?
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Roger
> > > > >
> > > > > On Wed, Aug 16, 2017 at 12:10 PM, Rajini Sivaram <
> > > > rajinisiva...@gmail.com>
> > > > > wrote:
> > > > >
> > > > >> I have created a KIP to add some additional metrics to support
> > health
> > > > >> checks:
> > > > >>
> > > > >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-188+-+
> > > > >> Add+new+metrics+to+support+health+checks
> > > > >>
> > > > >> Feedback and suggestions are welcome.
> > > > >>
> > > > >> Regards,
> > > > >>
> > > > >> Rajini
> > > > >>
> > > > >
> > > > >
> > > >
> > >
> >
>


Re: [DISCUSS] KIP-188 - Add new metrics to support health checks

2017-08-17 Thread Ismael Juma
Thanks for the KIP, Rajini. I think this is helpful too. A few minor
comments.

1. About the number of metrics and expiration, if we dynamically register
metrics for the error codes, the number is likely to be much lower than
30*30, probably less than 100. If we were using Kafka Metrics for this, we
could easily add a long expiration period to be conservative, but I am not
sure this is supported by Yammer Metrics. If it is not, there's an argument
for keeping it simple.

2. Would it make sense to use 2 gauges for the version and commit id? It
seems more intuitive than having those values as tags.

Ismael

On Thu, Aug 17, 2017 at 10:19 AM, Rajini Sivaram 
wrote:

> Hi Jun,
>
> Thank you for the review.
>
> 1. Makes sense. I have updated the KIP.
> 2. Moved to a new group ZooKeeperClient
> 3. It is a gauge, so it will have a single attribute called Value with a
> constant value of 1.
>
> Regards,
>
> Rajini
>
> On Thu, Aug 17, 2017 at 3:16 AM, Jun Rao  wrote:
>
> > Hi, Rajini,
> >
> > Thanks for the KIP. A few comments.
> >
> > 1. We have 30+ requests and 30+ error code and growing. So, the
> combination
> > can be large. Perhaps it's useful to expire an error metric if it's no
> > longer updated after some time? We did something similar for the quota
> > metric.
> >
> > 2. It's a bit weird to put the ZK latency metric under
> > type=SessionExpireListener.
> > Perhaps it's more intuitive to put it in a separate type.
> >
> > 3. For the client version metric, since we representing commit_id and
> > version as tags in the metric name. So the mbean will have no attributes?
> >
> > Jun
> >
> >
> >
> > On Wed, Aug 16, 2017 at 4:05 PM, Roger Hoover 
> > wrote:
> >
> > > I think it would useful to make clear somewhere for each metric, the
> > level
> > > at which it's counted.  I don't know all the details of the Kafka
> > protocol
> > > but it might be something like
> > >
> > > ProduceRequest, Fetch Request - counted at per-partition level
> > > All other requests are 1:1 with client requests?
> > >
> > > Cheers,
> > >
> > > Roger
> > >
> > > On Wed, Aug 16, 2017 at 4:02 PM, Roger Hoover 
> > > wrote:
> > >
> > > > Rajini,
> > > >
> > > > Thank you for the KIP.  These are very helpful additions.  One
> question
> > > on
> > > > the error code metrics:
> > > >
> > > > Will the total error counting happen at the the level of topic
> > partition?
> > > > For example, if a single ProduceRequest contains messages to append
> to
> > 3
> > > > partitions and say all 3 appends are successful, the counter
> > > > for kafka.network:type=RequestMetrics,name=ErrorsPerSec,request=
> > > ProduceRequest,error=0
> > > > will be incremented by 3?
> > > >
> > > > Thanks,
> > > >
> > > > Roger
> > > >
> > > > On Wed, Aug 16, 2017 at 12:10 PM, Rajini Sivaram <
> > > rajinisiva...@gmail.com>
> > > > wrote:
> > > >
> > > >> I have created a KIP to add some additional metrics to support
> health
> > > >> checks:
> > > >>
> > > >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-188+-+
> > > >> Add+new+metrics+to+support+health+checks
> > > >>
> > > >> Feedback and suggestions are welcome.
> > > >>
> > > >> Regards,
> > > >>
> > > >> Rajini
> > > >>
> > > >
> > > >
> > >
> >
>


Re: [DISCUSS] KIP-188 - Add new metrics to support health checks

2017-08-17 Thread Rajini Sivaram
Hi Jun,

Thank you for the review.

1. Makes sense. I have updated the KIP.
2. Moved to a new group ZooKeeperClient
3. It is a gauge, so it will have a single attribute called Value with a
constant value of 1.

Regards,

Rajini

On Thu, Aug 17, 2017 at 3:16 AM, Jun Rao  wrote:

> Hi, Rajini,
>
> Thanks for the KIP. A few comments.
>
> 1. We have 30+ requests and 30+ error code and growing. So, the combination
> can be large. Perhaps it's useful to expire an error metric if it's no
> longer updated after some time? We did something similar for the quota
> metric.
>
> 2. It's a bit weird to put the ZK latency metric under
> type=SessionExpireListener.
> Perhaps it's more intuitive to put it in a separate type.
>
> 3. For the client version metric, since we representing commit_id and
> version as tags in the metric name. So the mbean will have no attributes?
>
> Jun
>
>
>
> On Wed, Aug 16, 2017 at 4:05 PM, Roger Hoover 
> wrote:
>
> > I think it would useful to make clear somewhere for each metric, the
> level
> > at which it's counted.  I don't know all the details of the Kafka
> protocol
> > but it might be something like
> >
> > ProduceRequest, Fetch Request - counted at per-partition level
> > All other requests are 1:1 with client requests?
> >
> > Cheers,
> >
> > Roger
> >
> > On Wed, Aug 16, 2017 at 4:02 PM, Roger Hoover 
> > wrote:
> >
> > > Rajini,
> > >
> > > Thank you for the KIP.  These are very helpful additions.  One question
> > on
> > > the error code metrics:
> > >
> > > Will the total error counting happen at the the level of topic
> partition?
> > > For example, if a single ProduceRequest contains messages to append to
> 3
> > > partitions and say all 3 appends are successful, the counter
> > > for kafka.network:type=RequestMetrics,name=ErrorsPerSec,request=
> > ProduceRequest,error=0
> > > will be incremented by 3?
> > >
> > > Thanks,
> > >
> > > Roger
> > >
> > > On Wed, Aug 16, 2017 at 12:10 PM, Rajini Sivaram <
> > rajinisiva...@gmail.com>
> > > wrote:
> > >
> > >> I have created a KIP to add some additional metrics to support health
> > >> checks:
> > >>
> > >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-188+-+
> > >> Add+new+metrics+to+support+health+checks
> > >>
> > >> Feedback and suggestions are welcome.
> > >>
> > >> Regards,
> > >>
> > >> Rajini
> > >>
> > >
> > >
> >
>


Re: [DISCUSS] KIP-188 - Add new metrics to support health checks

2017-08-17 Thread Rajini Sivaram
Hi Roger,

Thank you for the review. I have added a table with the scope of errors
counted for each request.

Regards,

Rajini

On Thu, Aug 17, 2017 at 12:05 AM, Roger Hoover 
wrote:

> I think it would useful to make clear somewhere for each metric, the level
> at which it's counted.  I don't know all the details of the Kafka protocol
> but it might be something like
>
> ProduceRequest, Fetch Request - counted at per-partition level
> All other requests are 1:1 with client requests?
>
> Cheers,
>
> Roger
>
> On Wed, Aug 16, 2017 at 4:02 PM, Roger Hoover 
> wrote:
>
> > Rajini,
> >
> > Thank you for the KIP.  These are very helpful additions.  One question
> on
> > the error code metrics:
> >
> > Will the total error counting happen at the the level of topic partition?
> > For example, if a single ProduceRequest contains messages to append to 3
> > partitions and say all 3 appends are successful, the counter
> > for kafka.network:type=RequestMetrics,name=ErrorsPerSec,request=
> ProduceRequest,error=0
> > will be incremented by 3?
> >
> > Thanks,
> >
> > Roger
> >
> > On Wed, Aug 16, 2017 at 12:10 PM, Rajini Sivaram <
> rajinisiva...@gmail.com>
> > wrote:
> >
> >> I have created a KIP to add some additional metrics to support health
> >> checks:
> >>
> >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-188+-+
> >> Add+new+metrics+to+support+health+checks
> >>
> >> Feedback and suggestions are welcome.
> >>
> >> Regards,
> >>
> >> Rajini
> >>
> >
> >
>


Re: [DISCUSS] KIP-188 - Add new metrics to support health checks

2017-08-16 Thread Jun Rao
Hi, Rajini,

Thanks for the KIP. A few comments.

1. We have 30+ requests and 30+ error code and growing. So, the combination
can be large. Perhaps it's useful to expire an error metric if it's no
longer updated after some time? We did something similar for the quota
metric.

2. It's a bit weird to put the ZK latency metric under
type=SessionExpireListener.
Perhaps it's more intuitive to put it in a separate type.

3. For the client version metric, since we representing commit_id and
version as tags in the metric name. So the mbean will have no attributes?

Jun



On Wed, Aug 16, 2017 at 4:05 PM, Roger Hoover 
wrote:

> I think it would useful to make clear somewhere for each metric, the level
> at which it's counted.  I don't know all the details of the Kafka protocol
> but it might be something like
>
> ProduceRequest, Fetch Request - counted at per-partition level
> All other requests are 1:1 with client requests?
>
> Cheers,
>
> Roger
>
> On Wed, Aug 16, 2017 at 4:02 PM, Roger Hoover 
> wrote:
>
> > Rajini,
> >
> > Thank you for the KIP.  These are very helpful additions.  One question
> on
> > the error code metrics:
> >
> > Will the total error counting happen at the the level of topic partition?
> > For example, if a single ProduceRequest contains messages to append to 3
> > partitions and say all 3 appends are successful, the counter
> > for kafka.network:type=RequestMetrics,name=ErrorsPerSec,request=
> ProduceRequest,error=0
> > will be incremented by 3?
> >
> > Thanks,
> >
> > Roger
> >
> > On Wed, Aug 16, 2017 at 12:10 PM, Rajini Sivaram <
> rajinisiva...@gmail.com>
> > wrote:
> >
> >> I have created a KIP to add some additional metrics to support health
> >> checks:
> >>
> >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-188+-+
> >> Add+new+metrics+to+support+health+checks
> >>
> >> Feedback and suggestions are welcome.
> >>
> >> Regards,
> >>
> >> Rajini
> >>
> >
> >
>


Re: [DISCUSS] KIP-188 - Add new metrics to support health checks

2017-08-16 Thread Roger Hoover
I think it would useful to make clear somewhere for each metric, the level
at which it's counted.  I don't know all the details of the Kafka protocol
but it might be something like

ProduceRequest, Fetch Request - counted at per-partition level
All other requests are 1:1 with client requests?

Cheers,

Roger

On Wed, Aug 16, 2017 at 4:02 PM, Roger Hoover 
wrote:

> Rajini,
>
> Thank you for the KIP.  These are very helpful additions.  One question on
> the error code metrics:
>
> Will the total error counting happen at the the level of topic partition?
> For example, if a single ProduceRequest contains messages to append to 3
> partitions and say all 3 appends are successful, the counter
> for 
> kafka.network:type=RequestMetrics,name=ErrorsPerSec,request=ProduceRequest,error=0
> will be incremented by 3?
>
> Thanks,
>
> Roger
>
> On Wed, Aug 16, 2017 at 12:10 PM, Rajini Sivaram 
> wrote:
>
>> I have created a KIP to add some additional metrics to support health
>> checks:
>>
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-188+-+
>> Add+new+metrics+to+support+health+checks
>>
>> Feedback and suggestions are welcome.
>>
>> Regards,
>>
>> Rajini
>>
>
>


Re: [DISCUSS] KIP-188 - Add new metrics to support health checks

2017-08-16 Thread Roger Hoover
Rajini,

Thank you for the KIP.  These are very helpful additions.  One question on
the error code metrics:

Will the total error counting happen at the the level of topic partition?
For example, if a single ProduceRequest contains messages to append to 3
partitions and say all 3 appends are successful, the counter
for 
kafka.network:type=RequestMetrics,name=ErrorsPerSec,request=ProduceRequest,error=0
will be incremented by 3?

Thanks,

Roger

On Wed, Aug 16, 2017 at 12:10 PM, Rajini Sivaram 
wrote:

> I have created a KIP to add some additional metrics to support health
> checks:
>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> 188+-+Add+new+metrics+to+support+health+checks
>
> Feedback and suggestions are welcome.
>
> Regards,
>
> Rajini
>


[DISCUSS] KIP-188 - Add new metrics to support health checks

2017-08-16 Thread Rajini Sivaram
I have created a KIP to add some additional metrics to support health
checks:

https://cwiki.apache.org/confluence/display/KAFKA/KIP-188+-+Add+new+metrics+to+support+health+checks

Feedback and suggestions are welcome.

Regards,

Rajini