[
https://issues.apache.org/jira/browse/FLINK-22488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17333032#comment-17333032
]
Dong Lin commented on FLINK-22488:
----------------------------------
[~dwysakowicz] Right, FLINK-21996 tries to address the issue where losing an
OperatorEvent could lead to inconsistent state. And it suggests that
"Intermediate RPC messages from JM to TM can get dropped" due to temporary
connection failure.
It is not surprising that connection could be lost temporarily in Azure. I
guess my suggestion is, if we could figure out the exact reason why this
connection failure is happening at this rate (1+ failure per day in Azure
pipeline), maybe we could increase number of RPC retries in our test setup to
significantly reduce the flakiness of this test.
> KafkaSourceLegacyITCase.testOneToOneSources failed due to "OperatorEvent from
> an OperatorCoordinator to a task was lost"
> ------------------------------------------------------------------------------------------------------------------------
>
> Key: FLINK-22488
> URL: https://issues.apache.org/jira/browse/FLINK-22488
> Project: Flink
> Issue Type: Improvement
> Reporter: Dong Lin
> Priority: Minor
> Labels: pull-request-available
>
> According to [1], the test KafkaSourceLegacyITCase.testOneToOneSources failed
> because it runs a streaming job (which uses KafkaSource) with
> restartAttempts=1. In addition to the failover explicitly triggered by the
> FailingIdentityMapper, the job additionally failed due to
> "org.apache.flink.util.FlinkException: An OperatorEvent from an
> OperatorCoordinator to a task was lost. Triggering task failover to ensure
> consistency", which is unexpected by the test.
> Note that SubtaskGatewayImpl was updated by [2] on 4/14 which triggers task
> failover if any OperatorEvent was lost. This could explain why those Kafka
> tests start to fail due to the exception described above.
> In order to make this test stable, let's try to understand why there is such
> a high chance of loosing OperatorEvent in the Azure test pipeline. And if we
> could not avoid loosing OperatorEvent in the test pipeline, we probably need
> to update the test to allow the pipeline being restarted arbitrary times (and
> still be able to stop the test on the happy path).
> [1]
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=17212&view=logs&j=c5f0071e-1851-543e-9a45-9ac140befc32&t=1fb1a56f-e8b5-5a82-00a0-a2db7757b4f5&l=6960
> [2] https://github.com/apache/flink/pull/15605
--
This message was sent by Atlassian Jira
(v8.3.4#803005)