Re: [DISCUSS] KIP-714: Client metrics and observability

Feng Min Mon, 20 Sep 2021 17:35:48 -0700

Thanks Magnus & Colin for the discussion.

Based on KIP-714's stateless design, Client can pretty much use any
connection to any broker to send metrics. We are not associating connection
with client metric state. Is my understanding correct? If yes,  how about
the following two scenarios


1) One Client (Client-ID) registers two different client instance id via
separate registration. Is it permitted? If OK, how to distinguish them from
the case 2 below.

2) How about the client restarting? What's the expectation? Should the
server expect the client to carry a persisted client instance id or should
the client be treated as a new instance?

also some comments inline.

On Mon, Sep 20, 2021 at 11:41 AM Colin McCabe <[email protected]> wrote:

> On Tue, Sep 14, 2021, at 00:47, Magnus Edenhill wrote:
> > Thanks for your feedback Colin, see my updated proposal below.
> > ...
>
> Hi Magnus,
>
> Thanks for the update.
>
> >
> > Splitting up the API into separate data and control requests makes sense.
> > With a split we would have one API for querying the broker for configured
> > metrics subscriptions,
> > and one API for pushing the collected metrics to the broker.
> >
> > A mechanism is still needed to notify the client when the subscription is
> > changed;
> > I’ve added a SubscriptionId for this purpose (which could be a checksum
> of
> > the configured metrics subscription), this id is sent to the client along
> > with the metrics subscription, and the client sends it back to the broker
> > when pushing metrics. If the broker finds the pushed subscription id to
> > differ from what is expected it will return an error to the client, which
> > triggers the client to retrieve the new subscribed metrics and an updated
> > subscription id. The generation of the subscriptionId is opaque to the
> > client.
> >
>
> Hmm, SubscriptionId seems rather complex. We don't have this kind of
> complicated machinery for changing ApiVersions, and that is something that
> can also change over time, and which affects the clients.
>
> Changing the configured metrics should be extremely rare. In this case,
> why don't we just close all connections on the broker side? Then the
> clients can re-connect and re-fetch the information about the metrics
> they're supposed to send.
>
> >
> > Something like this:
> >
> > // Get the configured metrics subscription.
> > GetTelemetrySubscriptionsRequest {
> >    StrNull  ClientInstanceId  // Null on first invocation to retrieve a
> > newly generated instance id from the broker.
> > }
>
> +1 on RegisterClient or RegisterMetricClient


> It seems like the goal here is to have the client register itself, so that
> we can tell if this is an old client reconnecting. If that is the case,
> then I suggest to rename the RPC to RegisterClient.
>
> I think we need a better name than "clientInstanceId" since that name is
> very similar to "clientId." Perhaps something like originId? Or clientUuid?
> Let's also use UUID here rather than a string.
>
> > 6. > PushTelemetryRequest{
> >        ClientInstanceId = f00d-feed-deff-ceff-ffff-….,
> >        SubscriptionId = 0x234adf34,
> >        ContentType = OTLPv8|ZSTD,
> >        Terminating = False,
> >        Metrics = …// zstd-compressed OTLPv08-protobuf-serialized metrics
> >   }
>
>
If we assume connection is not bound to ClientInstanceId, and the RPC can
be sent to any broker (not necessarily the broker doing the registration).
The client-instance-id is required for every metric reporting. It's just
part of the labelling.


> It's not necessary for the client to re-send its client instance ID here,
> since it already registered with RegisterClient. If the TCP connection
> dropped, it will have to re-send RegisterClient anyway. SubscriptionID we
> should get rid of, as I said above.
>
> I don't see the need for protobufs. Why not just use Kafka's own
> serialization mechanism? As much as possible, we should try to avoid
> creating "turduckens" of protocol X containing a buffer serialized with
> protocol Y, containing a protocol serialized with protocol Z. These aren't
> conducive to a good implementation, and make it harder for people to write
> clients. Just use Kafka's RPC protocol (with optional fields if you wish).
>
> If we do compression on Kafka RPC, I would prefer that we do it a more
> generic way that applies to all control messages, not just this one. I also
> doubt we need to support lots and lots of different compression codecs, at
> first at least.
>
> Another thing I'd like to understand is whether we truly need
> "terminating" (or anything like it). I'm still confused about how the
> backend could use this. Keep in mind that we may receive it on multiple
> brokers (or not receive it at all). We may receive more stuff about client
> XYZ from broker 1 after we have already received a "terminated" for client
> XYZ from broker 2.
>
> > If the broker connection goes down or the connection is to be used for
> > other purposes (e.g., blocking FetchRequests), the client will send
> > PushTelemetryRequests to any other broker in the cluster, using the same
> > ClientInstanceId and SubscriptionId as received in the latest
> > GetTelemetrySubscriptionsResponse.
> >
> > While the subscriptionId may change during the lifetime of the client
> > instance (when metric subscriptions are updated), the ClientInstanceId is
> > only acquired once and must not change (as it is used to identify the
> > unique client instance).
> > ...
> > What we do want though is ability to single out a specific client
> instance
> > to give it a more fine-grained subscription for troubleshooting, and
> > we can do that with the current proposal with matching solely on the
> > CLIENT_INSTANCE_ID.
> > In other words; all clients will have the same standard metrics
> > subscription, but specific client instances can have alternate
> > subscriptions.
>
> That makes sense, and gives a good reason why we might want to couple
> finding the metrics info to passing the client UUID.
>
> > The metrics collector/tsdb/whatever will need to identify a single client
> > instance, regardless of which broker received the metrics.
> > The chapter on CLIENT_INSTANCE_ID motivates why we need a unique
> > identifier, basically because neither clientID, principal or remote
> > address:port, etc, can be
> > used to identify a single client instance.
> >
>
> Thanks for the background. I agree that a UUID is useful here.
>
> This also gives additional reasons why using a UUID rather than a
> free-form string is desirable (many databases can handle numbers more
> efficiently than strings).
>
> > > Unfortunately, we have to strictly control the metrics format, because
> > > otherwise clients can't implement it. I agree that we don't need to
> specify
> > > how the broker-side code works, since that is pluggable. It's also
> > > reasonable for the clients to have pluggable extensions as well, but
> this
> > > KIP won't be of much use if we don't at least define a basic set of
> metrics
> > > that most clients can understand how to send. The open source clients
> will
> > > not implement anything more than what is specified in the KIP (or at
> least
> > > the AK one won't...)
> > >
> >
> > Makes sense, in the updated proposal above I changed ContentType to a
> > bitmask.
> >
>
> Hmm. You might have forgotten to save your changes. It's still listed as a
> string in the KIP.
>
> > > I'm not sure if OpenCensus adds any value to this KIP, to be honest.
> Their
> > > primary focus was never on the format of the data being sent (in fact,
> the
> > > last time they checked, they left the format up to each OpenCensus
> > > implementation). That may have changed, but I think it still has
> limited
> > > usefulness to us, since we have our own format which we have to use
> anyway.
> > >
> >
> > Oh, I meant concensus as in kafka-dev agreement :)
> >
> > Feng is looking into the implementation details of the Java client and
> will
> > update the KIP with regards to dependencies.
>
> I still don't understand what value using the OpenTelemetry format is
> giving us here, as opposed to using Kafka's own format. I guess we will
> need to talk more about this.
>
> > > Hmm, that data is about 10 fields, most of which are strings. It
> certainly
> > > adds a lot of overhead to resend it each time.
> > >
> > > I don't follow the comment about unpacking and repacking -- since the
> > > client registered with the broker it already knows all this
> information, so
> > > there's nothing to unpack or repack, except from memory. If it's more
> > > convenient to serialize it once rather than multiple times, that is an
> > > implementation detail of the broker side plugin, which we are not
> > > specifying here anyway.
> > >
> >
> > The current proposal is pretty much stateless on the broker, it does not
> > need to hold any state for a client (instance), and no state
> > synchronization is needed
> > between brokers in the cluster, which allows a client to seamlessly send
> > metrics to any broker it wants and keeps the API overhead down (no need
> to
> > re-register when
> > switching brokers for instance).
> >
> > We could remove the labels that are already available to the broker on a
> > per-request basis or that it already maintains state for:
> >  - client_id
> >  - client_instance_id
> >  - client_software_*
> >
> > Leaving the following to still be included:
> >  - group_id
> >  - group_instance_id
> >  - transactional_id
> >   etc..
> >
> > What do you think of that?
>
> Yes, that might be a reasonable balance. I don't think we currently
> associate group_id with a connection, so it would be reasonable to re-send
> it in the telemetry request.
>
> What I would suggest next is to set up a protocol definition for the
> metrics you are sending over the wire. For example, you could specify
> something like this for producer metrics:
>
> > { "name": "clientProducerRecordBytes", "type": "int64",
> > "versions": "0+", "taggedVersions": "0+", "tag": 0,
> >  "about": "client.producer.record.bytes" },
> > { "name": "clientProducerQueueMaxMessages", "type": "int32",
> > "versions": "0+", "taggedVersions": "0+", "tag": 1,
> > "about": "client.producer.queue.max.messages" },
>
> This specifies what it is (int32, int64, etc.), how it's being sent, etc.
>
> best,
> Colin
>
> >
> >
> > Thanks,
> > Magnus
> >
> >
> >
> > >
> > > best,
> > > Colin
> > >
> > > > Thanks,
> > > > Magnus
> > > >
> > > > >
> > > > >
> > > >
> > >
> >
>


-- 
Best,
Feng

Re: [DISCUSS] KIP-714: Client metrics and observability

Reply via email to