C0urante commented on PR #16788: URL: https://github.com/apache/kafka/pull/16788#issuecomment-2272238495
> I think that's still possible, if we're the first leader and don't get a chance to write a key to the topic before executing a request from a caller. As long as we're using the `sessioned` protocol, if there's no key in the topic and we're receiving a request, then the request is guaranteed to fail, since it won't have a signature. I think this is only really possible if a cluster is upgraded from some other protocol to `sessioned`; otherwise, the leader will have a chance to write a new key to the config topic before doing anything else like writing new connector configs. We could possibly add retry logic for followers to wait for a session key to be published before sending requests to the leader when the `sessioned` protocol is enabled, but unless I'm mistaken, this would probably require unrelated changes to this PR and can probably be considered out of scope. > If the key is null, we could reject it with a 403 and not reveal that the cluster is still starting up, triggering the retry logic on the requesting worker. Returning a 403 response instead of a 503 won't change the retry logic of the worker issuing the request. We still retry on (almost) every error when forwarding task configs (unless the connector has generated too many tasks configs, as of [KIP-1004](https://cwiki.apache.org/confluence/display/KAFKA/KIP-1004%3A+Enforce+tasks.max+property+in+Kafka+Connect)), and on no errors when requesting zombie fencings. The only difference is that we'd log something at [debug level](https://github.com/apache/kafka/blob/5596f9e1d561dfef46a83fbd7264e6745ca538cc/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/distributed/DistributedHerder.java#L2190) instead of at [error level](https://github.com/apache/kafka/blob/5596f9e1d561dfef46a83fbd7264e6745ca538cc/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/distributed/DistributedHerder.java#L2195). Unless there's a serious security concern (which we should probably address via a separate channel), I'd prefer not to revert the 503 logic, since it is a useful signal for people reading worker logs (including me... it really helped when debugging these test failures!), and we might even leverage it for retry logic in the future. > We could also move the key verification to the herder thread, so that requests made during startup are delayed until the herder has finished starting, and we're sure that we've tried to install a key if one does not already exist. This might be less secure though, as it would allow a non-authenticated caller to saturate the herder tick thread. Given the small security issue this presents, unless there's still a correctness problem I'm missing, I don't think this is a superior alternative. But still, if you have further concerns, I'd be happy to hear them. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
