Re: [DISCUSS] KIP-714: Client metrics and observability

Doğuşcan Namal Tue, 08 Aug 2023 00:35:26 -0700

Thanks for your answers Andrew. I share your pain that it took a while for
you to get this KIP approved and you want to reduce the scope of it, will
be happy to help you with the implementation :)


Could you help me walk through what happens if the target broker is
unreachable? Is the client going to drop these metrics or is it going to
send it to the other brokers it is connected to? This information is
crucial to understand the client side impact on leadership failovers.
Moreover, in case of partial outages, such as only the network between the
client and the broker is partitioned whereas the network within the cluster
is healthy, practically there is no other way than the client side metrics
to identify this problem.

Doguscan

On Fri, 4 Aug 2023 at 15:33, Andrew Schofield <
andrew_schofield_j...@outlook.com> wrote:

> Hi Doguscan,
> Thanks for your comments. I’m glad to hear you’re interested in this KIP.
>
> 1) It is preferred that a client sends its metrics to the same broker
> connection
> but actually it is able to send them to any broker. As a result, if a
> broker becomes
> unhealthy, the client can push its metrics to any other broker. It seems
> to me that
> pushing to KRaft controllers instead just has the effect of increasing the
> load on
> the controllers, while still having the characteristic that an unhealthy
> controller
> would present inconvenience for collecting metrics.
>
> 2) When the `PushTelemetryRequest.Terminating` flag is set, the standard
> request
> throttling is not disabled. The metrics rate-limiting based on the push
> interval is
> not applied in this case for a single request for the combination of
> client instance ID
> and subscription ID.
>
> (I have corrected the KIP text because it erroneously said “client ID and
> subscription ID”.
>
> 3) While this is a theoretical problem, I’m not keen on adding yet more
> configurations
> to the broker or client. The `interval.ms` configuration on the
> CLIENT_METRICS
> resource could perhaps have a minimum and maximum value to prevent
> accidental
> misconfiguration.
>
> 4) One of the reasons that this KIP has taken so long to get to this stage
> is that
> it tried to do many things all at once. So, it’s greatly simplified
> compared with
> 6 months ago. I can see the value of collecting client configurations for
> problem
> determination, but I don’t want to make this KIP more complicated. I think
> the
> idea has merit as a separate follow-on KIP. I would be happy to collaborate
> with you on this.
>
> 5) The default is set to 5 minutes to minimise the load on the broker for
> situations
> in which the administrator didn’t set an interval on a metrics
> subscription. To
> use an interval of 1 minute, it is only necessary to set `interval.ms` on
> the metrics
> subscription to 60000ms.
>
> 6) Uncompressed data is always supported. The KIP says:
>  "The CompressionType of NONE will not be
> "present in the response from the broker, though the broker does support
> uncompressed
> "client telemetry if none of the accepted compression codecs are supported
> by the client.”
> So in your example, the client need only use CompressionType=NONE.
>
> Thanks,
> Andrew
>
> > On 4 Aug 2023, at 14:04, Doğuşcan Namal <namal.dogus...@gmail.com>
> wrote:
> >
> > Hi Andrew, thanks a lot for this KIP. I was thinking of something similar
> > so thanks for writing this down 😊
> >
> >
> >
> > Couple of questions related to the design:
> >
> >
> >
> > 1. Can we investigate the option for using the Kraft controllers instead
> of
> > the brokers for sending metrics? The disadvantage of sending these
> metrics
> > directly to the brokers tightly couples metric observability to data
> plane
> > availability. If the broker is unhealthy then the root cause of an
> incident
> > is clear however on partial failures it makes it hard to debug these
> > incidents from the brokers perspective.
> >
> >
> >
> > 2. Ratelimiting will be disable if the `PushTelemetryRequest.Terminating`
> > flag is set. However, this may cause unavailability on the broker if too
> > many clients are terminated at once, especially network threads could
> > become busy and introduce latency on the produce/consume on other
> > non-terminating clients connections. I think there is a room for
> > improvement here. If the client is gracefully shutting down, it could
> wait
> > for the request to be handled if it is being ratelimited, it doesn't need
> > to "force push" the metrics. For that reason, maybe we could define a
> > separate ratelimiting for telemetry data?
> >
> >
> >
> > 3. `PushIntervalMs` is set on the client side by a response from
> > `GetTelemetrySubscriptionsResponse`. If the broker sets this value to too
> > low, like 1msec, this may hog all of the clients activity and cause an
> > impact on the client side. I think we should introduce a configuration
> both
> > on the client and the broker side for the minimum and maximum numbers for
> > this value to fence out misconfigurations.
> >
> >
> >
> > 4. One of the important things I face during debugging the client side
> > failures is to understand the client side configurations. Can the client
> > sends these configs during the GetTelemetrySubscriptions request as well?
> >
> >
> >
> > Small comments:
> >
> > 5. Default PushIntervalMs is 5 minutes. Can we make it 1 minute instead?
> I
> > think 5 minutes of aggregated data is too not helpful in the world of
> > telemetry 😊
> >
> > 6. UnsupportedCompressionType: Shall we fallback to non-compression mode
> in
> > that case? I think compression is nice to have, but non-compressed
> > telemetry data is valuable as well. Especially for low throughput
> clients,
> > compressing telemetry data may cause more CPU load then the actual data
> > plane work.
> >
> >
> > Thanks again.
> >
> > Doguscan
> >
> >
> >
> >> On Jun 13, 2023, at 8:06 AM, Andrew Schofield
> >
> >> <andrew_schofield_j...@outlook.com> wrote:
> >
> >>
> >
> >> Hi,
> >
> >> I would like to start a new discussion thread on KIP-714: Client metrics
> > and
> >
> >> observability.
> >
> >>
> >
> >>
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-714%3A+Client+metrics+and+observability
> >
> >>
> >
> >> I have edited the proposal significantly to reduce the scope. The
> overall
> >
> >> mechanism for client metric subscriptions is unchanged, but the
> >
> >> KIP is now based on the existing client metrics, rather than introducing
> > new
> >
> >> metrics. The purpose remains helping cluster operators
> >
> >> investigate performance problems experienced by clients without
> requiring
> >
> >> changes to the client application code or configuration.
> >
> >>
> >
> >> Thanks,
> >
> >> Andrew
>
>

Re: [DISCUSS] KIP-714: Client metrics and observability

Reply via email to