Dong Lin created FLINK-22488:
--------------------------------

             Summary: KafkaSourceLegacyITCase.testOneToOneSources failed due to 
"OperatorEvent from an OperatorCoordinator to a task was lost"
                 Key: FLINK-22488
                 URL: https://issues.apache.org/jira/browse/FLINK-22488
             Project: Flink
          Issue Type: Improvement
            Reporter: Dong Lin


According to [1], the test KafkaSourceLegacyITCase.testOneToOneSources failed 
because it runs a streaming job (which uses KafkaSource) with 
restartAttempts=1. In addition to the failover explicitly triggered by the 
FailingIdentityMapper, the job additionally failed due to 
"org.apache.flink.util.FlinkException: An OperatorEvent from an 
OperatorCoordinator to a task was lost. Triggering task failover to ensure 
consistency", which is unexpected by the test.

Note that SubtaskGatewayImpl was updated by [2] on 4/14 which triggers task 
failover if any OperatorEvent was lost. This could explain why those Kafka 
tests start to fail due to the exception described above.

In order to make this test stable, let's try to understand why there is such a 
high chance of loosing OperatorEvent in the Azure test pipeline. And if we 
could not avoid loosing OperatorEvent in the test pipeline, we probably need to 
update the test to allow the pipeline being restarted arbitrary times (and 
still be able to stop the test on the happy path).


[1] 
https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=17212&view=logs&j=c5f0071e-1851-543e-9a45-9ac140befc32&t=1fb1a56f-e8b5-5a82-00a0-a2db7757b4f5&l=6960
[2] https://github.com/apache/flink/pull/15605




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to