[ 
https://issues.apache.org/jira/browse/KAFKA-15891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17899244#comment-17899244
 ] 

Will Perlichek edited comment on KAFKA-15891 at 11/18/24 7:35 PM:
------------------------------------------------------------------

[~ChrisEgerton] [~yash.mayya] [~gregharris73]

Hi all,

I have been working on this ticket for about a week now, specifically diving 
into OffsetsApiIntegrationTest.java and trying to resolve flakiness.

I want to touch base now and make sure I am on the right track, and I saw that 
all three of you had worked on or contributed to discussions on this file.

I'll attempt a very targeted question to keep it brief:

I strongly think that _some_ flakiness for 
testResetSinkConnectorOffsetsOverriddenConsumerGroupId could be reduced by 
restarting the connect cluster if we're sure the problem was due to a zombie 
sink task.

We can be reasonably confident that zombie sink tasks caused this method not to 
finish, by looking at this example CI failure stack trace shows message

Devlocity reference: 
[https://ge.apache.org/s/r4f5opmfmls54/tests/task/:connect:runtime:quarantinedTest/details/org.apache.kafka.connect.integration.OffsetsApiIntegrationTest/testResetSinkConnectorOffsetsOverriddenConsumerGroupId()?top-execution=1]

Stack trace:
ERROR Failed to reset consumer group offsets for connector 
testResetSinkConnectorOffsetsOverriddenConsumerGroupId either because its tasks 
haven't stopped completely yet or the connector was resumed before the request 
to reset its offsets could be successfully completed. If the connector is in a 
stopped state, this operation can be safely retried. If it doesn't eventually 
succeed, the Connect cluster may need to be restarted to get rid of the zombie 
sink tasks.

We retried for 30 seconds so to me the evidence suggests it's the zombie sink 
task problem...

It is actually the helper method modifySinkConnectorOffsetsWithRetry that times 
out when using waitForCondition. My assumption is that waitForCondition never 
succeeds because this zombie task causes a GroupNotEmptyException every time we 
try to use the Offset API because we can’t delete offsets due to the zombie 
sink task.

My question:

In the test code here

[https://github.com/apache/kafka/blob/50c15b94c94fbe8f964703c057963b38100b0bd6/connect/runtime/src/test/java/org/apache/kafka/connect/integration/OffsetsApiIntegrationTest.java#L775]

I can restart the connect cluster as advised by the exception message we get... 

I think this would reduce flakiness in this test, and a similar approach could 
be adopted to reduce the flakiness of other tests in the class such as 
testAlterSinkConnectorOffsetsDifferentKafkaClusterTargeted that also appears to 
be flaky due to zombie sink tasks.

I'd like to attempt a solution on this if you think this approach is correct. 
Or, Is this too much of a band-aid and not addressing the core problem? If the 
latter, can you suggest a more robust approach to handle zombie sink tasks in 
the context of this class? My goal here is to reduce test flakiness overall in 
this class. 

Thanks,
Will


was (Author: JIRAUSER307496):
[~ChrisEgerton] [~yash.mayya] [~gregharris73]

Hi all,

I have been working on this ticket for about a week now, specifically diving 
into OffsetsApiIntegrationTest.java and trying to resolve flakiness.

I want to touch base now and make sure I am on the right track, and I saw that 
all three of you had worked on or contributed to discussions on this file.

I'll attempt a very targeted question to keep it brief:

I strongly think that _some _flakiness for 
testResetSinkConnectorOffsetsOverriddenConsumerGroupId could be reduced by 
restarting the connect cluster if we're sure the problem was due to a zombie 
sink task.

We can be reasonably confident that zombie sink tasks caused this method not to 
finish, by looking at this example CI failure stack trace shows message

Devlocity reference: 
[https://ge.apache.org/s/r4f5opmfmls54/tests/task/:connect:runtime:quarantinedTest/details/org.apache.kafka.connect.integration.OffsetsApiIntegrationTest/testResetSinkConnectorOffsetsOverriddenConsumerGroupId()?top-execution=1]

Stack trace:
ERROR Failed to reset consumer group offsets for connector 
testResetSinkConnectorOffsetsOverriddenConsumerGroupId either because its tasks 
haven't stopped completely yet or the connector was resumed before the request 
to reset its offsets could be successfully completed. If the connector is in a 
stopped state, this operation can be safely retried. If it doesn't eventually 
succeed, the Connect cluster may need to be restarted to get rid of the zombie 
sink tasks.

We retried for 30 seconds so to me the evidence suggests it's the zombie sink 
task problem...

It is actually the helper method modifySinkConnectorOffsetsWithRetry that times 
out when using waitForCondition. My assumption is that waitForCondition never 
succeeds because this zombie task causes a GroupNotEmptyException every time we 
try to use the Offset API because we can’t delete offsets due to the zombie 
sink task.

My question:

In the test code here

[https://github.com/apache/kafka/blob/50c15b94c94fbe8f964703c057963b38100b0bd6/connect/runtime/src/test/java/org/apache/kafka/connect/integration/OffsetsApiIntegrationTest.java#L775]

I can restart the connect cluster as advised by the exception message we get... 

I think this would reduce flakiness in this test, and a similar approach could 
be adopted to reduce the flakiness of other tests in the class such as 
testAlterSinkConnectorOffsetsDifferentKafkaClusterTargeted that also appears to 
be flaky due to zombie sink tasks.

I'd like to attempt a solution on this if you think this approach is correct. 
Or, Is this too much of a band-aid and not addressing the core problem? If the 
latter, can you suggest a more robust approach to handle zombie sink tasks in 
the context of this class? My goal here is to reduce test flakiness overall in 
this class. 

Thanks,
Will

> Flaky test: testResetSinkConnectorOffsetsOverriddenConsumerGroupId – 
> org.apache.kafka.connect.integration.OffsetsApiIntegrationTest
> -----------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: KAFKA-15891
>                 URL: https://issues.apache.org/jira/browse/KAFKA-15891
>             Project: Kafka
>          Issue Type: Bug
>          Components: connect
>            Reporter: Apoorv Mittal
>            Assignee: Will Perlichek
>            Priority: Major
>              Labels: flaky-test
>
> h4. Error
> org.opentest4j.AssertionFailedError: Condition not met within timeout 30000. 
> Sink connector consumer group offsets should catch up to the topic end 
> offsets ==> expected: <true> but was: <false>
> h4. Stacktrace
> org.opentest4j.AssertionFailedError: Condition not met within timeout 30000. 
> Sink connector consumer group offsets should catch up to the topic end 
> offsets ==> expected: <true> but was: <false>
>  at 
> app//org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151)
>  at 
> app//org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132)
>  at app//org.junit.jupiter.api.AssertTrue.failNotTrue(AssertTrue.java:63)
>  at app//org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:36)
>  at app//org.junit.jupiter.api.Assertions.assertTrue(Assertions.java:210)
>  at 
> app//org.apache.kafka.test.TestUtils.lambda$waitForCondition$3(TestUtils.java:331)
>  at 
> app//org.apache.kafka.test.TestUtils.retryOnExceptionWithTimeout(TestUtils.java:379)
>  at app//org.apache.kafka.test.TestUtils.waitForCondition(TestUtils.java:328)
>  at app//org.apache.kafka.test.TestUtils.waitForCondition(TestUtils.java:312)
>  at app//org.apache.kafka.test.TestUtils.waitForCondition(TestUtils.java:302)
>  at 
> app//org.apache.kafka.connect.integration.OffsetsApiIntegrationTest.verifyExpectedSinkConnectorOffsets(OffsetsApiIntegrationTest.java:917)
>  at 
> app//org.apache.kafka.connect.integration.OffsetsApiIntegrationTest.resetAndVerifySinkConnectorOffsets(OffsetsApiIntegrationTest.java:725)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to