[ 
https://issues.apache.org/jira/browse/FLINK-20928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17419263#comment-17419263
 ] 

Dong Lin commented on FLINK-20928:
----------------------------------

[~dwysakowicz] I found a possible explanation for the flaky test and created a 
PR to fix this issue in https://github.com/apache/flink/pull/17342. Would you 
have time to review this PR?

Here are the problems with the existing code that could explain why the test is 
flaky:

- The test calls KafkaSourceReader.notifyCheckpointComplete(...) once and 
expects the offset commit to be successful.
- However, KafkaSourceReader.notifyCheckpointComplete(...) does not guarantee 
the offset commit to be successfully. This is because it calls 
KafkaConsumer.commitAsync(...) just once and won't retry even if the commit 
fails with an retriable exception.
- During in the test, if the coordinator is temporarily unavailable due to e.g. 
coordinator movement or network disconnection, the test will fail due to 
TimeoutException.


> KafkaSourceReaderTest.testOffsetCommitOnCheckpointComplete:189->pollUntil:270 
> » Timeout
> ---------------------------------------------------------------------------------------
>
>                 Key: FLINK-20928
>                 URL: https://issues.apache.org/jira/browse/FLINK-20928
>             Project: Flink
>          Issue Type: Bug
>          Components: Connectors / Kafka
>    Affects Versions: 1.13.0, 1.14.0, 1.15.0
>            Reporter: Robert Metzger
>            Assignee: Qingsheng Ren
>            Priority: Critical
>              Labels: pull-request-available, test-stability
>             Fix For: 1.14.0
>
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=11861&view=logs&j=c5f0071e-1851-543e-9a45-9ac140befc32&t=1fb1a56f-e8b5-5a82-00a0-a2db7757b4f5
> {code}
> [ERROR] Tests run: 8, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 
> 93.992 s <<< FAILURE! - in 
> org.apache.flink.connector.kafka.source.reader.KafkaSourceReaderTest
> [ERROR] 
> testOffsetCommitOnCheckpointComplete(org.apache.flink.connector.kafka.source.reader.KafkaSourceReaderTest)
>   Time elapsed: 60.086 s  <<< ERROR!
> java.util.concurrent.TimeoutException: The offset commit did not finish 
> before timeout.
>       at 
> org.apache.flink.core.testutils.CommonTestUtils.waitUtil(CommonTestUtils.java:210)
>       at 
> org.apache.flink.connector.kafka.source.reader.KafkaSourceReaderTest.pollUntil(KafkaSourceReaderTest.java:270)
>       at 
> org.apache.flink.connector.kafka.source.reader.KafkaSourceReaderTest.testOffsetCommitOnCheckpointComplete(KafkaSourceReaderTest.java:189)
>       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to