lianetm commented on code in PR #16686:
URL: https://github.com/apache/kafka/pull/16686#discussion_r1747345639


##########
clients/src/main/java/org/apache/kafka/clients/consumer/internals/AsyncKafkaConsumer.java:
##########
@@ -1302,6 +1304,42 @@ private void releaseAssignmentAndLeaveGroup(final Timer 
timer) {
         }
     }
 
+    /**
+     * The unsubscribe process requires a handful of back-and-forth trips 
between the application thread and
+     * the background thread:
+     *
+     * <ol>
+     *     <li>
+     *         Application thread: enqueue {@link UnsubscribeEvent}
+     *     </li>
+     *     <li>
+     *         Background thread: process {@link UnsubscribeEvent} and
+     *                            enqueue {@link 
ConsumerRebalanceListenerCallbackNeededEvent}
+     *     </li>
+     *     <li>
+     *         Application thread: process {@link 
ConsumerRebalanceListenerCallbackNeededEvent},
+     *                             invoke appropriate {@link 
ConsumerRebalanceListener} method, and
+     *                             enqueue {@link 
ConsumerRebalanceListenerCallbackCompletedEvent}
+     *     </li>
+     *     <li>
+     *         Background thread: process {@link 
ConsumerRebalanceListenerCallbackCompletedEvent} and
+     *                            enqueue {@link 
NetworkClientDelegate.UnsentRequest} to send the
+     *                            {@link ConsumerGroupHeartbeatRequest} to 
leave the consumer group
+     *     </li>
+     * </ol>
+     *
+     * In cases where the incoming {@link Timer timer} has very little 
remaining time, e.g. 0, it is impossible
+     * to perform the thread switches necessary to leave the group. Therefore, 
in cases where there isn't much
+     * of a timeout left, increase it slightly (presently 1000 ms.) to improve 
the chances for the consumer to
+     * properly leave the group.

Review Comment:
   very interesting, and good that you tried out the close(0) approach because 
it reveals a bigger gap we still have, regardless of the interrupt issue: close 
with low timeouts may still not send the leave. 
   
   This is definitely bigger than the interrupt, so what about we tackle this 
separately, to properly think about the best approach (this wait for 1000 is 
definitely one but seems brittle, what's the magic number there?). We could 
maybe simply have a way to GenerateLeaveRequestEvent that we can explicitly 
call and wait on when we know we didn't have time to Unsubscribe properly). We 
should also add an integration test like the one you have here but for 
close(0). 
   
   What about:
   1. unblock this PR, only for the interrupt (description includes the low 
timeouts too), with the approach you had of allowing a close with it's original 
timeout.
   2. new jira to review and fix close with low timeout that may not send leave 
group
   3. new jira blocked on #2, to review close behaviour on interrupt, and 
attempt to respect the interrupted status better, by not allowing the close to 
take up all the time it wants
    Thoughts?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to