Re: [PR] PROPOSAL: support async event execution in group coordinator [kafka]

via GitHub Thu, 09 Nov 2023 11:35:42 -0800


artemlivshits commented on code in PR #14705:
URL: https://github.com/apache/kafka/pull/14705#discussion_r1388499287



##########
group-coordinator/src/main/java/org/apache/kafka/coordinator/group/runtime/CoordinatorRuntime.java:
##########
@@ -659,9 +660,21 @@ public TopicPartition key() {
          */
         @Override
         public void run() {
+            try {
+                runAsync().get();

Review Comment:
   > if one client's log append is blocked for additional async check
   
   That is correct, it may become a perf problem, we can measure and see if 
it's worth fixing in practice, we'll have this choice (as well as the choice to 
postpone the fix, if we have time pressure to release).  But it won't be a 
functional problem.  Right now it is a functional problem, which is suboptimal 
in many ways:
   - appendRecords has async interface, thus adding async stages under such an 
interface can be done without inspection and understanding all callers (that's 
what an interface is -- any compliant implementation is valid), but doing so 
will break the current logic (so from the proper interface usage perspective it 
is a bug in the caller, which this proposal fixes)
   - we cannot release new transaction protocol (or new coordinator) without 
implementing new logic, which makes hard dependencies and pushes against 
timelines (now all of a sudden KIP-848 got a new work to do before release, 
just because there is some independent work is going on in transaction area)
   - KIP-890 part2 design is still under discussion, the verification protocol 
is likely to change, so any changes in KIP-890 protocol are going to have 
ripple effects on KIP-848
   - 2 fairly complex components are now tied together -- we cannot just 
innovate on transaction protocol implementation details (or to be broader -- on 
the whole IO subsystem implementation details -- e.g. Async IO) without 
understanding group coordinator implementation detail and we cannot innovate on 
group coordinator implementation detail without understanding implementation 
details of transaction protocol
   - to make the previous point worse, the dependency is not visible at the 
"point of use" -- someone tasked with improving transaction protocol (or IO in 
general) would have no indication from the appendRecords interface, that adding 
an async stage would need to have a corresponding change in group coordinator
   - the work needs to be duplicated in group coordinator (and the protocol is 
going slightly different for different client versions) which becomes a likely 
source of bugs
   
   IMO, the fact that transaction verification implementation just doesn't work 
out-of-box with the new group coordinator (and in fact requires quite 
non-trivial follow-up work that will block the release) is an architectural 
issue.  We should strive to make the system more decoupled, so that the context 
an engineer needs to understand to make local changes in a part of system is 
less.
   
   > Each new group coordinator thread handles requests from multiple groups 
and multiple clients within the same group.
   
   I don't think it's bound to a thread, but indeed the concurrency is limited 
to partition -- we don't let operations on the same partition run concurrently, 
so all the groups that are mapped to the same partition are contending.  This 
is, however, a specific implementation choice, it should be possible to make a 
group to be a unit of concurrency, and if that's not enough, we can let offset 
commits for different partitions go concurrently as well (they just need to 
make sure that group doesn't change, which is sort of a "read lock" on the 
group), at which point there probably wouldn't be any contention in the common 
path.
   
   Now, one might ask a question, implementing per-group synchronization adds 
complexity and handling transaction verification as an explicit state 
transition in group coordinator adds complexity, what the difference?  I'd say 
the difference is fundamental -- per-group synchronization complexity is 
encapsulated in one component and keeps the system decoupled: an engineer 
tasked to improve transaction protocol, doesn't need understand implementation 
details of group coordinator and vice versa.  Changes are smaller, can be made 
faster, and less bug prone.  Win-win-win.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] PROPOSAL: support async event execution in group coordinator [kafka]

Reply via email to