[ https://issues.apache.org/jira/browse/KAFKA-15891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17899244#comment-17899244 ]
Will Perlichek edited comment on KAFKA-15891 at 11/18/24 7:38 PM: ------------------------------------------------------------------ [~ChrisEgerton] [~yash.mayya] [~gregharris73] Hi all, I have been working on this ticket for about a week now, specifically diving into OffsetsApiIntegrationTest.java and trying to resolve flakiness. I want to touch base now and make sure I am on the right track, and I saw that all three of you had worked on or contributed to discussions on this file. I'll attempt a very targeted question to keep it brief: I strongly think that _some_ flakiness for testResetSinkConnectorOffsetsOverriddenConsumerGroupId could be reduced by restarting the connect cluster if we're sure the problem was due to a zombie sink task. We can be reasonably confident that zombie sink tasks caused this method not to finish, by looking at this example CI failure stack trace shows message Devlocity reference: [https://ge.apache.org/s/r4f5opmfmls54/tests/task/:connect:runtime:quarantinedTest/details/org.apache.kafka.connect.integration.OffsetsApiIntegrationTest/testResetSinkConnectorOffsetsOverriddenConsumerGroupId()?top-execution=1] Stack trace: {code:java} ERROR Failed to reset consumer group offsets for connector testResetSinkConnectorOffsetsOverriddenConsumerGroupId either because its tasks haven't stopped completely yet or the connector was resumed before the request to reset its offsets could be successfully completed. If the connector is in a stopped state, this operation can be safely retried. If it doesn't eventually succeed, the Connect cluster may need to be restarted to get rid of the zombie sink tasks.{code} We retried for 30 seconds so to me the evidence suggests it's the zombie sink task problem... It is actually the helper method modifySinkConnectorOffsetsWithRetry that times out when using waitForCondition. My assumption is that waitForCondition never succeeds because this zombie task causes a GroupNotEmptyException every time we try to use the Offset API because we can’t delete offsets due to the zombie sink task. My question: In the test code here [https://github.com/apache/kafka/blob/50c15b94c94fbe8f964703c057963b38100b0bd6/connect/runtime/src/test/java/org/apache/kafka/connect/integration/OffsetsApiIntegrationTest.java#L775] I can restart the connect cluster as advised by the exception message we get... I can try one more time to call modifySinkConnectorOffsetsWithRetry after restarting the connect cluster, if I catch an exception there with a try/catch that has "zombie sink task" string in it. I think this would reduce flakiness in this test, and a similar approach could be adopted to reduce the flakiness of other tests in the class such as testAlterSinkConnectorOffsetsDifferentKafkaClusterTargeted that also appears to be flaky due to zombie sink tasks. I'd like to attempt a solution on this if you think this approach is correct. Or, Is this too much of a band-aid and not addressing the core problem? If the latter, can you suggest a more robust approach to handle zombie sink tasks in the context of this class? My goal here is to reduce test flakiness overall in this class. Thanks, Will was (Author: JIRAUSER307496): [~ChrisEgerton] [~yash.mayya] [~gregharris73] Hi all, I have been working on this ticket for about a week now, specifically diving into OffsetsApiIntegrationTest.java and trying to resolve flakiness. I want to touch base now and make sure I am on the right track, and I saw that all three of you had worked on or contributed to discussions on this file. I'll attempt a very targeted question to keep it brief: I strongly think that _some_ flakiness for testResetSinkConnectorOffsetsOverriddenConsumerGroupId could be reduced by restarting the connect cluster if we're sure the problem was due to a zombie sink task. We can be reasonably confident that zombie sink tasks caused this method not to finish, by looking at this example CI failure stack trace shows message Devlocity reference: [https://ge.apache.org/s/r4f5opmfmls54/tests/task/:connect:runtime:quarantinedTest/details/org.apache.kafka.connect.integration.OffsetsApiIntegrationTest/testResetSinkConnectorOffsetsOverriddenConsumerGroupId()?top-execution=1] Stack trace: {code:java} ERROR Failed to reset consumer group offsets for connector testResetSinkConnectorOffsetsOverriddenConsumerGroupId either because its tasks haven't stopped completely yet or the connector was resumed before the request to reset its offsets could be successfully completed. If the connector is in a stopped state, this operation can be safely retried. If it doesn't eventually succeed, the Connect cluster may need to be restarted to get rid of the zombie sink tasks.{code} We retried for 30 seconds so to me the evidence suggests it's the zombie sink task problem... It is actually the helper method modifySinkConnectorOffsetsWithRetry that times out when using waitForCondition. My assumption is that waitForCondition never succeeds because this zombie task causes a GroupNotEmptyException every time we try to use the Offset API because we can’t delete offsets due to the zombie sink task. My question: In the test code here [https://github.com/apache/kafka/blob/50c15b94c94fbe8f964703c057963b38100b0bd6/connect/runtime/src/test/java/org/apache/kafka/connect/integration/OffsetsApiIntegrationTest.java#L775] I can restart the connect cluster as advised by the exception message we get... I think this would reduce flakiness in this test, and a similar approach could be adopted to reduce the flakiness of other tests in the class such as testAlterSinkConnectorOffsetsDifferentKafkaClusterTargeted that also appears to be flaky due to zombie sink tasks. I'd like to attempt a solution on this if you think this approach is correct. Or, Is this too much of a band-aid and not addressing the core problem? If the latter, can you suggest a more robust approach to handle zombie sink tasks in the context of this class? My goal here is to reduce test flakiness overall in this class. Thanks, Will > Flaky test: testResetSinkConnectorOffsetsOverriddenConsumerGroupId – > org.apache.kafka.connect.integration.OffsetsApiIntegrationTest > ----------------------------------------------------------------------------------------------------------------------------------- > > Key: KAFKA-15891 > URL: https://issues.apache.org/jira/browse/KAFKA-15891 > Project: Kafka > Issue Type: Bug > Components: connect > Reporter: Apoorv Mittal > Assignee: Will Perlichek > Priority: Major > Labels: flaky-test > > h4. Error > org.opentest4j.AssertionFailedError: Condition not met within timeout 30000. > Sink connector consumer group offsets should catch up to the topic end > offsets ==> expected: <true> but was: <false> > h4. Stacktrace > org.opentest4j.AssertionFailedError: Condition not met within timeout 30000. > Sink connector consumer group offsets should catch up to the topic end > offsets ==> expected: <true> but was: <false> > at > app//org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151) > at > app//org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132) > at app//org.junit.jupiter.api.AssertTrue.failNotTrue(AssertTrue.java:63) > at app//org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:36) > at app//org.junit.jupiter.api.Assertions.assertTrue(Assertions.java:210) > at > app//org.apache.kafka.test.TestUtils.lambda$waitForCondition$3(TestUtils.java:331) > at > app//org.apache.kafka.test.TestUtils.retryOnExceptionWithTimeout(TestUtils.java:379) > at app//org.apache.kafka.test.TestUtils.waitForCondition(TestUtils.java:328) > at app//org.apache.kafka.test.TestUtils.waitForCondition(TestUtils.java:312) > at app//org.apache.kafka.test.TestUtils.waitForCondition(TestUtils.java:302) > at > app//org.apache.kafka.connect.integration.OffsetsApiIntegrationTest.verifyExpectedSinkConnectorOffsets(OffsetsApiIntegrationTest.java:917) > at > app//org.apache.kafka.connect.integration.OffsetsApiIntegrationTest.resetAndVerifySinkConnectorOffsets(OffsetsApiIntegrationTest.java:725) -- This message was sent by Atlassian Jira (v8.20.10#820010)