cmccabe commented on PR #16672: URL: https://github.com/apache/kafka/pull/16672#issuecomment-2251010757
> This is a pretty large PR, and it seems to accomplish two separate things: first, it enables the broker to check for topic existence before sending the directory assignment request; and second, it improves the efficiency of the AssignmentsManager by batching requests. Is that correct? I'm wondering if we can separate out these two aspects into two individual (and thus smaller) PRs. That would make them easier to review. The previous code did do batching, but in a somewhat different way. The main difference is that it created one event queue event per request, a pretty inefficient pattern. In general I do not think it is feasible to separate this out into multiple PRs. > With regard to the actual bug -- we can get stuck retrying infinitely for partitions that no longer exist -- I see that you added a dependency on the metadata module within the server module. The `:server` module is intended as a replacement for `:core`, but written in Java. It's not intended to have minimal dependencies. For example, it already depends on `:server-common`, `:storage`, `:group-coordinator`, `:transaction-coordinator`, and `:raft`. So there isn't any reason to avoid this dependency. > Could we avoid this if we just allowed the broker to accept a response from the controller indicating that the topic no longer exists? After all, the controller is the book of record, so why do we need to check locally? This case of getting stuck due to a topic being rapidly created and then deleted is expected to be unusual. We've added a bunch of implementation on the broker side to deal with the possibility, but given the rarity of the problem I am wondering if that additional broker-side implementation is worth it when the controller can just tel us the truth and we can believe it. It is inefficient to send a request that we know is futile (i.e. our metadata image already shows the topic gone). This is a pretty common case, actually, for short-lived topics! I have some clusters with "health check" code that creates and deletes a topic periodically, and they trigger this case every single time. It is not rare or unusual. Similarly, it's pretty common for partitions to be reassigned. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
