[ 
https://issues.apache.org/jira/browse/KAFKA-8940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16955581#comment-16955581
 ] 

ASF GitHub Bot commented on KAFKA-8940:
---------------------------------------

guozhangwang commented on pull request #7565: KAFKA-8940: Tighten up 
SmokeTestDriver
URL: https://github.com/apache/kafka/pull/7565
 
 
   After many runs of reproducing the failure (on my local MP5 it takes about 
100 - 200 run to get one) I think it is more likely a flaky one and not 
exposing a real bug in rebalance protocol.
   
   What I've observed is that, when the verifying consumer is trying to fetch 
from the output topics (there are 11 of them), it `poll(1sec)` each time, and 
retries 30 times if there's no more data to fetch and stop. It means that if 
there are no data fetched from the output topics for 30 * 1 = 30 seconds then 
the verification would stop (potentially too early). And for the failure cases, 
we observe consistent rebalancing among the closing / newly created clients 
since the closing is async, i.e. while new clients are added it is possible 
that closing clients triggered rebalance are not completed yet (note that each 
instance is configured with 3 threads, and in the worst case there are 6 
instances running / pending shutdown at the same time, so a group fo 3 * 6 = 18 
members is possible).
   
   However, there's still a possible bug that in KIP-429, somehow the rebalance 
can never stabilize and members keep re-rejoining and hence cause it to fail. 
We have another unit test that have bumped up to 3 rebalance by @ableegoldman 
and if that failed again then it may be a better confirmation such bug may 
exist.
   
   So what I've done so far for this test:
   
   1. When closing a client, wait for it to complete closure before moving on 
to the next iteration and starting a new instance to reduce the rebalance 
churns.
   
   2. Poll for 5 seconds instead of 1 to wait for longer time: 5 * 30 = 150 
seconds, and locally my laptop finished this test in about 50 seconds.
   
   3. Minor debug logging improvement; in fact some of them is to reduce 
redundant debug logging since it is too long and sometimes hides the key 
information.
   
   
   ### Committer Checklist (excluded from commit message)
   - [ ] Verify design and implementation 
   - [ ] Verify test coverage and CI build status
   - [ ] Verify documentation (including upgrade notes)
   
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Flaky Test SmokeTestDriverIntegrationTest.shouldWorkWithRebalance
> -----------------------------------------------------------------
>
>                 Key: KAFKA-8940
>                 URL: https://issues.apache.org/jira/browse/KAFKA-8940
>             Project: Kafka
>          Issue Type: Bug
>          Components: streams, unit tests
>            Reporter: Guozhang Wang
>            Assignee: John Roesler
>            Priority: Major
>              Labels: flaky-test
>
> I lost the screen shot unfortunately... it reports the set of expected 
> records does not match the received records.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to