[
https://issues.apache.org/jira/browse/FLINK-20928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17419263#comment-17419263
]
Dong Lin commented on FLINK-20928:
----------------------------------
[~dwysakowicz] I found a possible explanation for the flaky test and created a
PR to fix this issue in https://github.com/apache/flink/pull/17342. Would you
have time to review this PR?
Here are the problems with the existing code that could explain why the test is
flaky:
- The test calls KafkaSourceReader.notifyCheckpointComplete(...) once and
expects the offset commit to be successful.
- However, KafkaSourceReader.notifyCheckpointComplete(...) does not guarantee
the offset commit to be successfully. This is because it calls
KafkaConsumer.commitAsync(...) just once and won't retry even if the commit
fails with an retriable exception.
- During in the test, if the coordinator is temporarily unavailable due to e.g.
coordinator movement or network disconnection, the test will fail due to
TimeoutException.
> KafkaSourceReaderTest.testOffsetCommitOnCheckpointComplete:189->pollUntil:270
> » Timeout
> ---------------------------------------------------------------------------------------
>
> Key: FLINK-20928
> URL: https://issues.apache.org/jira/browse/FLINK-20928
> Project: Flink
> Issue Type: Bug
> Components: Connectors / Kafka
> Affects Versions: 1.13.0, 1.14.0, 1.15.0
> Reporter: Robert Metzger
> Assignee: Qingsheng Ren
> Priority: Critical
> Labels: pull-request-available, test-stability
> Fix For: 1.14.0
>
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=11861&view=logs&j=c5f0071e-1851-543e-9a45-9ac140befc32&t=1fb1a56f-e8b5-5a82-00a0-a2db7757b4f5
> {code}
> [ERROR] Tests run: 8, Failures: 0, Errors: 1, Skipped: 0, Time elapsed:
> 93.992 s <<< FAILURE! - in
> org.apache.flink.connector.kafka.source.reader.KafkaSourceReaderTest
> [ERROR]
> testOffsetCommitOnCheckpointComplete(org.apache.flink.connector.kafka.source.reader.KafkaSourceReaderTest)
> Time elapsed: 60.086 s <<< ERROR!
> java.util.concurrent.TimeoutException: The offset commit did not finish
> before timeout.
> at
> org.apache.flink.core.testutils.CommonTestUtils.waitUtil(CommonTestUtils.java:210)
> at
> org.apache.flink.connector.kafka.source.reader.KafkaSourceReaderTest.pollUntil(KafkaSourceReaderTest.java:270)
> at
> org.apache.flink.connector.kafka.source.reader.KafkaSourceReaderTest.testOffsetCommitOnCheckpointComplete(KafkaSourceReaderTest.java:189)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)