showuon edited a comment on pull request #10409: URL: https://github.com/apache/kafka/pull/10409#issuecomment-816544910
@ableegoldman , I ran multiple times on jenkins and print some debug logs. Please check here: https://ci-builds.apache.org/job/Kafka/job/kafka-pr/job/PR-9724/237/testReport/junit/org.apache.kafka.streams.integration/KTableKTableForeignKeyInnerJoinMultiIntegration10Test/Build___JDK_15_and_Scala_2_13___shouldInnerJoinMultiPartitionQueryable/ The error message: `failed 1st time` is because I don't stop the test immediately when the normal test timed out (60 seconds) for waiting consuming 1 record. I keep waiting for another 60 seconds to see what will happen. If it passed in 2nd 60 seconds, it'll failed with `failed 1st time` indicating this test passed in 2nd run, otherwise, it'll failed with normal error message (i.e. `Did not receive all 1 record...`). OK, let's check the logs, you can scroll down to the **standard error** logs (where I put my debug logs). At first, we have 3 stream client (thread) created: ``` Stream-client KTable-FKJ-Multi-fc16826a-c92f-4d47-91d7-7e17abf084b9CREATED,REBALANCING Stream-client KTable-FKJ-Multi-5a59900d-05ba-4208-ae6f-ff2b9d86ff93CREATED,REBALANCING stream-thread [KTable-FKJ-Multi-fc16826a-c92f-4d47-91d7-7e17abf084b9-StreamThread-1] , CREATED,STARTING stream-thread [KTable-FKJ-Multi-5a59900d-05ba-4208-ae6f-ff2b9d86ff93-StreamThread-1] , CREATED,STARTING Stream-client KTable-FKJ-Multi-c889e99c-7d74-4e3e-bb33-c0c03b6b2ac5CREATED,REBALANCING stream-thread [KTable-FKJ-Multi-c889e99c-7d74-4e3e-bb33-c0c03b6b2ac5-StreamThread-1] , CREATED,STARTING ``` And then, rebalancing the tasks. You can see this logs: (reason: Adding new member KTable-FKJ-Multi....) ``` 0: Preparing to rebalance group KTable-FKJ-Multi in state PreparingRebalance with old generation 1 (__consumer_offsets-3) (reason: Adding new member KTable-FKJ-Multi-c889e99c-7d74-4e3e-bb33-c0c03b6b2ac5-StreamThread-1-consumer-7fff842d-2157-410a-a939-cca43c0d8e29 with group instance id None) responseStatusMap(__consumer_offsets-3 -> {error: NONE,offset: 0,logAppendTime: -1, logStartOffset: 0, recordErrors: [], errorMessage: null}) Successfully synced group in generation:Generation{generationId=1, memberId='consumer-ktable-ktable-consumer-7-daa57628-31cb-498c-9d72-168874157cc1', protocol='range'} ``` Later, 1 consumer left the group with heartbeat expiration: (reason: removing member KTable-FKJ-Multi... on heartbeat expiration) ``` 0: Preparing to rebalance group KTable-FKJ-Multi in state PreparingRebalance with old generation 2 (__consumer_offsets-3) (reason: removing member KTable-FKJ-Multi-c889e99c-7d74-4e3e-bb33-c0c03b6b2ac5-StreamThread-1-consumer-7fff842d-2157-410a-a939-cca43c0d8e29 on heartbeat expiration) Group coordinator:, cause:session timed out without receiving a heartbeat response,isDis:false ``` And then, you can see this line, which means 60 seconds reached ``` !!! failed 1st time ``` In the meantime, you can see the stream state changed to `PARTITIONS_ASSIGNED`, `PARTITIONS_REVOKED`, `RUNNING`, just never reach a state with 3 stream client with `RUNNING` state Later, another consumer left group with heartbeat expiration again: (reason: removing member KTable-FKJ-Multi... on heartbeat expiration) ``` 0: Preparing to rebalance group KTable-FKJ-Multi in state PreparingRebalance with old generation 5 (__consumer_offsets-3) (reason: removing member KTable-FKJ-Multi-fc16826a-c92f-4d47-91d7-7e17abf084b9-StreamThread-1-consumer-080464ae-710f-478e-9195-73b55285e21d on heartbeat expiration) ``` Finally, we can see all 3 stream client reached `RUNNING` state: ``` stream-thread [KTable-FKJ-Multi-5a59900d-05ba-4208-ae6f-ff2b9d86ff93-StreamThread-1] , PARTITIONS_ASSIGNED,RUNNING Stream-client KTable-FKJ-Multi-5a59900d-05ba-4208-ae6f-ff2b9d86ff93REBALANCING,RUNNING stream-thread [KTable-FKJ-Multi-c889e99c-7d74-4e3e-bb33-c0c03b6b2ac5-StreamThread-1] , PARTITIONS_ASSIGNED,RUNNING stream-thread [KTable-FKJ-Multi-fc16826a-c92f-4d47-91d7-7e17abf084b9-StreamThread-1] , PARTITIONS_ASSIGNED,RUNNING Stream-client KTable-FKJ-Multi-c889e99c-7d74-4e3e-bb33-c0c03b6b2ac5REBALANCING,RUNNING Stream-client KTable-FKJ-Multi-fc16826a-c92f-4d47-91d7-7e17abf084b9REBALANCING,RUNNING ``` And then, starting to process the data: (committing task offset) ``` committing task off:{0_0={table1-0=OffsetAndMetadata{offset=1, leaderEpoch=null, metadata='AQAAAXil/kaK'}}... ``` And complete the test, to close all the streams: ``` closing Stream-client KTable-FKJ-Multi-fc16826a-c92f-4d47-91d7-7e17abf084b9RUNNING,PENDING_SHUTDOWN stream-thread [KTable-FKJ-Multi-fc16826a-c92f-4d47-91d7-7e17abf084b9-StreamThread-1] , RUNNING,PENDING_SHUTDOWN stream-thread [KTable-FKJ-Multi-fc16826a-c92f-4d47-91d7-7e17abf084b9-StreamThread-1] , PENDING_SHUTDOWN,DEAD Stream-client KTable-FKJ-Multi-fc16826a-c92f-4d47-91d7-7e17abf084b9PENDING_SHUTDOWN,NOT_RUNNING ``` That's basically what I observed, and that's why I want to increase the heartbeat interval to fix this flaky test. I also tried to force all 3 streams reach `RUNNING` state before creating another consumer to consume the data, but that doesn't help. (still failed sometimes while all 3 stream clients don't reach running within 60 secs). There are also other identical failed tests in this branch test. You can go here to check other test logs. https://ci-builds.apache.org/job/Kafka/job/kafka-pr/job/PR-9724/237/ If you want to check other logs, you can let me know. Thank you. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org