Some comments about subscriptionId. On Mon, Sep 20, 2021 at 11:41 AM Colin McCabe <cmcc...@apache.org> wrote:
> On Tue, Sep 14, 2021, at 00:47, Magnus Edenhill wrote: > > Thanks for your feedback Colin, see my updated proposal below. > > ... > > Hi Magnus, > > Thanks for the update. > > > > > Splitting up the API into separate data and control requests makes sense. > > With a split we would have one API for querying the broker for configured > > metrics subscriptions, > > and one API for pushing the collected metrics to the broker. > > > > A mechanism is still needed to notify the client when the subscription is > > changed; > > I’ve added a SubscriptionId for this purpose (which could be a checksum > of > > the configured metrics subscription), this id is sent to the client along > > with the metrics subscription, and the client sends it back to the broker > > when pushing metrics. If the broker finds the pushed subscription id to > > differ from what is expected it will return an error to the client, which > > triggers the client to retrieve the new subscribed metrics and an updated > > subscription id. The generation of the subscriptionId is opaque to the > > client. > > > > ApiVersion is not a good example. API Version here is actually acting like an identifier as the client will carry this information. Forcing to disconnect a connection from the server side is quite heavy. IMHO, the behavior is kind of part of the protocol. Adding subscriptionId is relatively simple and straightforward. > Hmm, SubscriptionId seems rather complex. We don't have this kind of > complicated machinery for changing ApiVersions, and that is something that > can also change over time, and which affects the clients. > > Changing the configured metrics should be extremely rare. In this case, > why don't we just close all connections on the broker side? Then the > clients can re-connect and re-fetch the information about the metrics > they're supposed to send. > > > > > Something like this: > > > > // Get the configured metrics subscription. > > GetTelemetrySubscriptionsRequest { > > StrNull ClientInstanceId // Null on first invocation to retrieve a > > newly generated instance id from the broker. > > } > > It seems like the goal here is to have the client register itself, so that > we can tell if this is an old client reconnecting. If that is the case, > then I suggest to rename the RPC to RegisterClient. > > I think we need a better name than "clientInstanceId" since that name is > very similar to "clientId." Perhaps something like originId? Or clientUuid? > Let's also use UUID here rather than a string. > > > 6. > PushTelemetryRequest{ > > ClientInstanceId = f00d-feed-deff-ceff-ffff-…., > > SubscriptionId = 0x234adf34, > > ContentType = OTLPv8|ZSTD, > > Terminating = False, > > Metrics = …// zstd-compressed OTLPv08-protobuf-serialized metrics > > } > > It's not necessary for the client to re-send its client instance ID here, > since it already registered with RegisterClient. If the TCP connection > dropped, it will have to re-send RegisterClient anyway. SubscriptionID we > should get rid of, as I said above. > > I don't see the need for protobufs. Why not just use Kafka's own > serialization mechanism? As much as possible, we should try to avoid > creating "turduckens" of protocol X containing a buffer serialized with > protocol Y, containing a protocol serialized with protocol Z. These aren't > conducive to a good implementation, and make it harder for people to write > clients. Just use Kafka's RPC protocol (with optional fields if you wish). > > If we do compression on Kafka RPC, I would prefer that we do it a more > generic way that applies to all control messages, not just this one. I also > doubt we need to support lots and lots of different compression codecs, at > first at least. > > Another thing I'd like to understand is whether we truly need > "terminating" (or anything like it). I'm still confused about how the > backend could use this. Keep in mind that we may receive it on multiple > brokers (or not receive it at all). We may receive more stuff about client > XYZ from broker 1 after we have already received a "terminated" for client > XYZ from broker 2. > > > If the broker connection goes down or the connection is to be used for > > other purposes (e.g., blocking FetchRequests), the client will send > > PushTelemetryRequests to any other broker in the cluster, using the same > > ClientInstanceId and SubscriptionId as received in the latest > > GetTelemetrySubscriptionsResponse. > > > > While the subscriptionId may change during the lifetime of the client > > instance (when metric subscriptions are updated), the ClientInstanceId is > > only acquired once and must not change (as it is used to identify the > > unique client instance). > > ... > > What we do want though is ability to single out a specific client > instance > > to give it a more fine-grained subscription for troubleshooting, and > > we can do that with the current proposal with matching solely on the > > CLIENT_INSTANCE_ID. > > In other words; all clients will have the same standard metrics > > subscription, but specific client instances can have alternate > > subscriptions. > > That makes sense, and gives a good reason why we might want to couple > finding the metrics info to passing the client UUID. > > > The metrics collector/tsdb/whatever will need to identify a single client > > instance, regardless of which broker received the metrics. > > The chapter on CLIENT_INSTANCE_ID motivates why we need a unique > > identifier, basically because neither clientID, principal or remote > > address:port, etc, can be > > used to identify a single client instance. > > > > Thanks for the background. I agree that a UUID is useful here. > > This also gives additional reasons why using a UUID rather than a > free-form string is desirable (many databases can handle numbers more > efficiently than strings). > > > > Unfortunately, we have to strictly control the metrics format, because > > > otherwise clients can't implement it. I agree that we don't need to > specify > > > how the broker-side code works, since that is pluggable. It's also > > > reasonable for the clients to have pluggable extensions as well, but > this > > > KIP won't be of much use if we don't at least define a basic set of > metrics > > > that most clients can understand how to send. The open source clients > will > > > not implement anything more than what is specified in the KIP (or at > least > > > the AK one won't...) > > > > > > > Makes sense, in the updated proposal above I changed ContentType to a > > bitmask. > > > > Hmm. You might have forgotten to save your changes. It's still listed as a > string in the KIP. > > > > I'm not sure if OpenCensus adds any value to this KIP, to be honest. > Their > > > primary focus was never on the format of the data being sent (in fact, > the > > > last time they checked, they left the format up to each OpenCensus > > > implementation). That may have changed, but I think it still has > limited > > > usefulness to us, since we have our own format which we have to use > anyway. > > > > > > > Oh, I meant concensus as in kafka-dev agreement :) > > > > Feng is looking into the implementation details of the Java client and > will > > update the KIP with regards to dependencies. > > I still don't understand what value using the OpenTelemetry format is > giving us here, as opposed to using Kafka's own format. I guess we will > need to talk more about this. > > > > Hmm, that data is about 10 fields, most of which are strings. It > certainly > > > adds a lot of overhead to resend it each time. > > > > > > I don't follow the comment about unpacking and repacking -- since the > > > client registered with the broker it already knows all this > information, so > > > there's nothing to unpack or repack, except from memory. If it's more > > > convenient to serialize it once rather than multiple times, that is an > > > implementation detail of the broker side plugin, which we are not > > > specifying here anyway. > > > > > > > The current proposal is pretty much stateless on the broker, it does not > > need to hold any state for a client (instance), and no state > > synchronization is needed > > between brokers in the cluster, which allows a client to seamlessly send > > metrics to any broker it wants and keeps the API overhead down (no need > to > > re-register when > > switching brokers for instance). > > > > We could remove the labels that are already available to the broker on a > > per-request basis or that it already maintains state for: > > - client_id > > - client_instance_id > > - client_software_* > > > > Leaving the following to still be included: > > - group_id > > - group_instance_id > > - transactional_id > > etc.. > > > > What do you think of that? > > Yes, that might be a reasonable balance. I don't think we currently > associate group_id with a connection, so it would be reasonable to re-send > it in the telemetry request. > > What I would suggest next is to set up a protocol definition for the > metrics you are sending over the wire. For example, you could specify > something like this for producer metrics: > > > { "name": "clientProducerRecordBytes", "type": "int64", > > "versions": "0+", "taggedVersions": "0+", "tag": 0, > > "about": "client.producer.record.bytes" }, > > { "name": "clientProducerQueueMaxMessages", "type": "int32", > > "versions": "0+", "taggedVersions": "0+", "tag": 1, > > "about": "client.producer.queue.max.messages" }, > > This specifies what it is (int32, int64, etc.), how it's being sent, etc. > > best, > Colin > > > > > > > Thanks, > > Magnus > > > > > > > > > > > > best, > > > Colin > > > > > > > Thanks, > > > > Magnus > > > > > > > > > > > > > > > > > > > > > > > > -- Best, Feng