[GitHub] [kafka] showuon edited a comment on pull request #10409: KAFKA-9295: increase heartbeat interval to avoid unneeded rebalance

GitBox Fri, 09 Apr 2021 02:23:41 -0700


showuon edited a comment on pull request #10409:
URL: https://github.com/apache/kafka/pull/10409#issuecomment-816544910



   @ableegoldman , I ran multiple times on jenkins and print some debug logs. 
Please check here:
   
https://ci-builds.apache.org/job/Kafka/job/kafka-pr/job/PR-9724/237/testReport/junit/org.apache.kafka.streams.integration/KTableKTableForeignKeyInnerJoinMultiIntegration10Test/Build___JDK_15_and_Scala_2_13___shouldInnerJoinMultiPartitionQueryable/
   
   The error message: `failed 1st time` is because I don't stop the test 
immediately when the normal test timed out (60 seconds) for waiting consuming 1 
record. I keep waiting for another 60 seconds to see what will happen. If it 
passed in 2nd 60 seconds, it'll failed with `failed 1st time` indicating this 
test passed in 2nd run, otherwise, it'll failed with normal error message (i.e. 
`Did not receive all 1 record...`).
   
   OK, let's check the logs, you can scroll down to the **standard error** logs 
(where I put my debug logs).
   At first, we have 3 stream client (thread) created:
   ```
   Stream-client 
KTable-FKJ-Multi-fc16826a-c92f-4d47-91d7-7e17abf084b9CREATED,REBALANCING
   Stream-client 
KTable-FKJ-Multi-5a59900d-05ba-4208-ae6f-ff2b9d86ff93CREATED,REBALANCING
   stream-thread 
[KTable-FKJ-Multi-fc16826a-c92f-4d47-91d7-7e17abf084b9-StreamThread-1] , 
CREATED,STARTING
   stream-thread 
[KTable-FKJ-Multi-5a59900d-05ba-4208-ae6f-ff2b9d86ff93-StreamThread-1] , 
CREATED,STARTING
   Stream-client 
KTable-FKJ-Multi-c889e99c-7d74-4e3e-bb33-c0c03b6b2ac5CREATED,REBALANCING
   stream-thread 
[KTable-FKJ-Multi-c889e99c-7d74-4e3e-bb33-c0c03b6b2ac5-StreamThread-1] , 
CREATED,STARTING
   ```
   And then, rebalancing the tasks. You can see this logs: (reason: Adding new 
member KTable-FKJ-Multi....)
   ```
   
   0: Preparing to rebalance group KTable-FKJ-Multi in state PreparingRebalance 
with old generation 1 (__consumer_offsets-3) (reason: Adding new member 
KTable-FKJ-Multi-c889e99c-7d74-4e3e-bb33-c0c03b6b2ac5-StreamThread-1-consumer-7fff842d-2157-410a-a939-cca43c0d8e29
 with group instance id None)
   responseStatusMap(__consumer_offsets-3 -> {error: NONE,offset: 
0,logAppendTime: -1, logStartOffset: 0, recordErrors: [], errorMessage: null})
   Successfully synced group in generation:Generation{generationId=1, 
memberId='consumer-ktable-ktable-consumer-7-daa57628-31cb-498c-9d72-168874157cc1',
 protocol='range'}
   ```
   
   Later, 1 consumer left the group with heartbeat expiration: (reason: 
removing member KTable-FKJ-Multi... on heartbeat expiration)
   ```
   0: Preparing to rebalance group KTable-FKJ-Multi in state PreparingRebalance 
with old generation 2 (__consumer_offsets-3) (reason: removing member 
KTable-FKJ-Multi-c889e99c-7d74-4e3e-bb33-c0c03b6b2ac5-StreamThread-1-consumer-7fff842d-2157-410a-a939-cca43c0d8e29
 on heartbeat expiration)
   Group coordinator:, cause:session timed out without receiving a heartbeat 
response,isDis:false
   ```
   And then, you can see this line, which means 60 seconds reached
   ```
   !!! failed 1st time
   ```
   In the meantime, you can see the stream state changed to 
`PARTITIONS_ASSIGNED`, `PARTITIONS_REVOKED`, `RUNNING`, just never reach a 
state with 3 stream client with `RUNNING` state
   Later, another consumer left group with heartbeat expiration again: (reason: 
removing member KTable-FKJ-Multi... on heartbeat expiration)
   ```
   0: Preparing to rebalance group KTable-FKJ-Multi in state PreparingRebalance 
with old generation 5 (__consumer_offsets-3) (reason: removing member 
KTable-FKJ-Multi-fc16826a-c92f-4d47-91d7-7e17abf084b9-StreamThread-1-consumer-080464ae-710f-478e-9195-73b55285e21d
 on heartbeat expiration)
   ```
   Finally, we can see all 3 stream client reached `RUNNING` state:
   ```
   stream-thread 
[KTable-FKJ-Multi-5a59900d-05ba-4208-ae6f-ff2b9d86ff93-StreamThread-1] , 
PARTITIONS_ASSIGNED,RUNNING
   Stream-client 
KTable-FKJ-Multi-5a59900d-05ba-4208-ae6f-ff2b9d86ff93REBALANCING,RUNNING
   stream-thread 
[KTable-FKJ-Multi-c889e99c-7d74-4e3e-bb33-c0c03b6b2ac5-StreamThread-1] , 
PARTITIONS_ASSIGNED,RUNNING
   stream-thread 
[KTable-FKJ-Multi-fc16826a-c92f-4d47-91d7-7e17abf084b9-StreamThread-1] , 
PARTITIONS_ASSIGNED,RUNNING
   Stream-client 
KTable-FKJ-Multi-c889e99c-7d74-4e3e-bb33-c0c03b6b2ac5REBALANCING,RUNNING
   Stream-client 
KTable-FKJ-Multi-fc16826a-c92f-4d47-91d7-7e17abf084b9REBALANCING,RUNNING
   ```
   And then, starting to process the data: (committing task offset)
   ```
   committing task off:{0_0={table1-0=OffsetAndMetadata{offset=1, 
leaderEpoch=null, metadata='AQAAAXil/kaK'}}...
   ```
   And complete the test, to close all the streams:
   ```
   closing
   Stream-client 
KTable-FKJ-Multi-fc16826a-c92f-4d47-91d7-7e17abf084b9RUNNING,PENDING_SHUTDOWN
   stream-thread 
[KTable-FKJ-Multi-fc16826a-c92f-4d47-91d7-7e17abf084b9-StreamThread-1] , 
RUNNING,PENDING_SHUTDOWN
   stream-thread 
[KTable-FKJ-Multi-fc16826a-c92f-4d47-91d7-7e17abf084b9-StreamThread-1] , 
PENDING_SHUTDOWN,DEAD
   Stream-client 
KTable-FKJ-Multi-fc16826a-c92f-4d47-91d7-7e17abf084b9PENDING_SHUTDOWN,NOT_RUNNING
   ```
   
   That's basically what I observed, and that's why I want to increase the 
heartbeat interval to fix this flaky test. I also tried to force all 3 streams 
reach `RUNNING` state before creating another consumer to consume the data, but 
that doesn't help. (still failed sometimes while all 3 stream clients don't 
reach running within 60 secs). There are also other identical failed tests in 
this branch test. You can go here to check other test logs.
   https://ci-builds.apache.org/job/Kafka/job/kafka-pr/job/PR-9724/237/
   
   If you want to check other logs, you can let me know. Thank you.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [kafka] showuon edited a comment on pull request #10409: KAFKA-9295: increase heartbeat interval to avoid unneeded rebalance

Reply via email to