chickenchickenlove opened a new pull request, #21332:
URL: https://github.com/apache/kafka/pull/21332

   ### Description
   This PR addresses KAFKA-17397 by ensuring deterministic behavior in 
`ClassicKafkaConsumer.close()` under interruption for the classic group 
protocol.
   
   In CI, `PlaintextConsumerTest.testCloseLeavesGroupOnInterrupt()` can fail 
for `ClassicKafkaConsumer` because `LeaveGroupRequest` is sometimes blocked by 
`NetworkClient.isReady()` when a metadata update is due. Since isReady 
prioritizes metadata requests, the pending `LeaveGroupRequest` can remain 
unsent. When the calling thread is already interrupted, `ConsumerNetworkClient` 
may throw an `InterruptException` before the pending request gets a chance to 
be sent. This causes the member to leave only via `session.timeout.ms`, 
resulting in test flakiness.
   
   ### Fix 
   Allow `LeaveGroupRequest` to bypass the `metadata-update` gating in 
`NetworkClient.isReady()` (while still respecting `canSendRequest`), ensuring 
the request is sent during `close()` even when a metadata update is due.
   
   ### Sequence Diagram
   <img width="7840" height="5910" alt="seq-1" 
src="https://github.com/user-attachments/assets/1849b465-fc9e-4b9b-a76a-66aeab17188e";
 />
   - Steps 9, 10, and 11 describe non-deterministic behavior that prevents the 
`ClassicKafkaConsumer` from successfully sending a `LEAVE_GROUP` request when 
it is interrupted.
   - This PR makes the sending of the LEAVE_GROUP request deterministic by 
bypassing the metadata update step specifically for LEAVE_GROUP. (See, Step 9, 
15, 16, 17, 18)
   
   ### Fixes
   - https://issues.apache.org/jira/browse/KAFKA-17397
   - https://issues.apache.org/jira/browse/KAFKA-18031
     - Flaky Test - 
https://develocity.apache.org/scans/tests?search.rootProjectNames=kafka&search.timeZoneId=Asia%2FTaipei&tests.container=kafka.api.PlaintextConsumerTest&tests.sortField=FLAKY&tests.test=testCloseLeavesGroupOnInterrupt(String)%5B1%5D
   
   ### In local re-produce
   - With current trunk branch, the test failed 1~2 times out of 20 runs.
   - With this PR, all 100 test runs succeeded (although there were a few build 
failures due to busy CPU and full memory.)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to