cmccabe commented on PR #16672:
URL: https://github.com/apache/kafka/pull/16672#issuecomment-2251010757

   > This is a pretty large PR, and it seems to accomplish two separate things: 
first, it enables the broker to check for topic existence before sending the 
directory assignment request; and second, it improves the efficiency of the 
AssignmentsManager by batching requests. Is that correct? I'm wondering if we 
can separate out these two aspects into two individual (and thus smaller) PRs. 
That would make them easier to review.
   
   The previous code did do batching, but in a somewhat different way. The main 
difference is that it created one event queue event per request, a pretty 
inefficient pattern. In general I do not think it is feasible to separate this 
out into multiple PRs.
   
   > With regard to the actual bug -- we can get stuck retrying infinitely for 
partitions that no longer exist -- I see that you added a dependency on the 
metadata module within the server module.
   
   The `:server` module is intended as a replacement for `:core`, but written 
in Java. It's not intended to have minimal dependencies. For example, it 
already depends on `:server-common`, `:storage`, `:group-coordinator`, 
`:transaction-coordinator`, and `:raft`. So there isn't any reason to avoid 
this dependency.
   
   > Could we avoid this if we just allowed the broker to accept a response 
from the controller indicating that the topic no longer exists? After all, the 
controller is the book of record, so why do we need to check locally? This case 
of getting stuck due to a topic being rapidly created and then deleted is 
expected to be unusual. We've added a bunch of implementation on the broker 
side to deal with the possibility, but given the rarity of the problem I am 
wondering if that additional broker-side implementation is worth it when the 
controller can just tel us the truth and we can believe it.
   
   It is inefficient to send a request that we know is futile (i.e. our 
metadata image already shows the topic gone). This is a pretty common case, 
actually, for short-lived topics! I have some clusters with "health check" code 
that creates and deletes a topic periodically, and they trigger this case every 
single time. It is not rare or unusual. Similarly, it's pretty common for 
partitions to be reassigned.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to