[ https://issues.apache.org/jira/browse/KAFKA-8940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16955581#comment-16955581 ]
ASF GitHub Bot commented on KAFKA-8940: --------------------------------------- guozhangwang commented on pull request #7565: KAFKA-8940: Tighten up SmokeTestDriver URL: https://github.com/apache/kafka/pull/7565 After many runs of reproducing the failure (on my local MP5 it takes about 100 - 200 run to get one) I think it is more likely a flaky one and not exposing a real bug in rebalance protocol. What I've observed is that, when the verifying consumer is trying to fetch from the output topics (there are 11 of them), it `poll(1sec)` each time, and retries 30 times if there's no more data to fetch and stop. It means that if there are no data fetched from the output topics for 30 * 1 = 30 seconds then the verification would stop (potentially too early). And for the failure cases, we observe consistent rebalancing among the closing / newly created clients since the closing is async, i.e. while new clients are added it is possible that closing clients triggered rebalance are not completed yet (note that each instance is configured with 3 threads, and in the worst case there are 6 instances running / pending shutdown at the same time, so a group fo 3 * 6 = 18 members is possible). However, there's still a possible bug that in KIP-429, somehow the rebalance can never stabilize and members keep re-rejoining and hence cause it to fail. We have another unit test that have bumped up to 3 rebalance by @ableegoldman and if that failed again then it may be a better confirmation such bug may exist. So what I've done so far for this test: 1. When closing a client, wait for it to complete closure before moving on to the next iteration and starting a new instance to reduce the rebalance churns. 2. Poll for 5 seconds instead of 1 to wait for longer time: 5 * 30 = 150 seconds, and locally my laptop finished this test in about 50 seconds. 3. Minor debug logging improvement; in fact some of them is to reduce redundant debug logging since it is too long and sometimes hides the key information. ### Committer Checklist (excluded from commit message) - [ ] Verify design and implementation - [ ] Verify test coverage and CI build status - [ ] Verify documentation (including upgrade notes) ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Flaky Test SmokeTestDriverIntegrationTest.shouldWorkWithRebalance > ----------------------------------------------------------------- > > Key: KAFKA-8940 > URL: https://issues.apache.org/jira/browse/KAFKA-8940 > Project: Kafka > Issue Type: Bug > Components: streams, unit tests > Reporter: Guozhang Wang > Assignee: John Roesler > Priority: Major > Labels: flaky-test > > I lost the screen shot unfortunately... it reports the set of expected > records does not match the received records. -- This message was sent by Atlassian Jira (v8.3.4#803005)