C0urante opened a new pull request, #16585:
URL: https://github.com/apache/kafka/pull/16585

   This test case has become flaky since 
https://github.com/apache/kafka/pull/16477 was merged.
   
   For a quick fix, we can give workers time to discover that the Kafka cluster 
has gone down in order for the health check endpoint to be accurate.
   
   Ideally, this would not be necessary and we'd always detect that the Kafka 
cluster is down at the top of our tick loop 
[here](https://github.com/apache/kafka/blob/0ada8fac6869cad8ac33a79032cf5d57bfa2a3ea/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/distributed/DistributedHerder.java#L419).
 However, with some rare-but-not-impossible timing, that method may return 
without sending any requests to the group coordinator. This is itself probably 
indicative of a bug, and if reviewers agree, we can file a Jira ticket for it. 
But, because this possibly-buggy behavior is not newly-introduced and the cause 
of failure for our tests is simply a failure to account for that existing 
behavior in newly-introduced logic, we should not block on a fix for that 
behavior before modifying this test case to prevent flakiness.
   
   ### Committer Checklist (excluded from commit message)
   - [ ] Verify design and implementation 
   - [ ] Verify test coverage and CI build status
   - [ ] Verify documentation (including upgrade notes)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to