[jira] [Issue Comment Deleted] (FLINK-22488) KafkaSourceLegacyITCase.testOneToOneSources failed due to "OperatorEvent from an OperatorCoordinator to a task was lost"

Dawid Wysakowicz (Jira) Tue, 27 Apr 2021 00:39:05 -0700


     [ 
https://issues.apache.org/jira/browse/FLINK-22488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Dawid Wysakowicz updated FLINK-22488:
-------------------------------------
    Comment: was deleted

(was: One observation:
* some of the sub-tasks finish right away after starting, which makes it 
impossible to take any checkpoints/savepoints
{code}
[Source: KafkaSource -> Map -> Map (3/8)#1] INFO  
org.apache.flink.connector.base.source.reader.SourceReaderBase [] - Closing 
Source Reader.
22:02:57,135 [Source: KafkaSource -> Map -> Map (5/8)#1] INFO  
org.apache.flink.connector.base.source.reader.SourceReaderBase [] - Reader 
received NoMoreSplits event.
22:02:57,136 [Source: KafkaSource -> Map -> Map (4/8)#1] INFO  
org.apache.flink.runtime.taskmanager.Task                    [] - Source: 
KafkaSource -> Map -> Map (4/8)#1 (334496a6fd2c371dd842e7ecb61811d9) switched 
from RUNNING to FINISHED.


22:05:27,161 [FailingIdentityMapper Status Printer] INFO  
org.apache.flink.streaming.connectors.kafka.testutils.FailingIdentityMapper [] 
- ============================> Failing mapper  6: count=1000, totalCount=1000
22:05:27,162 [FailingIdentityMapper Status Printer] INFO  
org.apache.flink.streaming.connectors.kafka.testutils.FailingIdentityMapper [] 
- ============================> Failing mapper  1: count=1000, totalCount=1000
22:05:27,163 [FailingIdentityMapper Status Printer] INFO  
org.apache.flink.streaming.connectors.kafka.testutils.FailingIdentityMapper [] 
- ============================> Failing mapper  7: count=1000, totalCount=1000
22:05:27,163 [FailingIdentityMapper Status Printer] INFO  
org.apache.flink.streaming.connectors.kafka.testutils.FailingIdentityMapper [] 
- ============================> Failing mapper  0: count=1000, totalCount=1000
22:05:27,163 [FailingIdentityMapper Status Printer] INFO  
org.apache.flink.streaming.connectors.kafka.testutils.FailingIdentityMapper [] 
- ============================> Failing mapper  5: count=1000, totalCount=1000
22:05:27,628 [    Checkpoint Timer] INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Failed to 
trigger checkpoint for job 461c989450b54b7847421b914ddb4f00 since some tasks of 
job 461c989450b54b7847421b914ddb4f00 has been finished, abort the checkpoint 
Failure reason: Not all required tasks are currently running.
22:05:28,128 [    Checkpoint Timer] INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Failed to 
trigger checkpoint for job 461c989450b54b7847421b914ddb4f00 since some tasks of 
job 461c989450b54b7847421b914ddb4f00 has been finished, abort the checkpoint 
Failure reason: Not all required tasks are currently running.
22:05:28,628 [    Checkpoint Timer] INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Failed to 
trigger checkpoint for job 461c989450b54b7847421b914ddb4f00 since some tasks of 
job 461c989450b54b7847421b914ddb4f00 has been finished, abort the checkpoint 
Failure reason: Not all required tasks are currently running.
22:05:29,128 [    Checkpoint Timer] INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Failed to 
trigger checkpoint for job 461c989450b54b7847421b914ddb4f00 since some tasks of 
job 461c989450b54b7847421b914ddb4f00 has been finished, abort the checkpoint 
Failure reason: Not all required tasks are currently running.
22:05:29,627 [    Checkpoint Timer] INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Failed to 
trigger checkpoint for job 461c989450b54b7847421b914ddb4f00 since some tasks of 
job 461c989450b54b7847421b914ddb4f00 has been finished, abort the checkpoint 
Failure reason: Not all required tasks are currently running.
22:05:30,128 [    Checkpoint Timer] INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Failed to 
trigger checkpoint for job 461c989450b54b7847421b914ddb4f00 since some tasks of 
job 461c989450b54b7847421b914ddb4f00 has been finished, abort the checkpoint 
Failure reason: Not all required tasks are currently running.
22:05:30,628 [    Checkpoint Timer] INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Failed to 
trigger checkpoint for job 461c989450b54b7847421b914ddb4f00 since some tasks of 
job 461c989450b54b7847421b914ddb4f00 has been finished, abort the checkpoint 
Failure reason: Not all required tasks are currently running.
22:05:31,128 [    Checkpoint Timer] INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Failed to 
trigger checkpoint for job 461c989450b54b7847421b914ddb4f00 since some tasks of 
job 461c989450b54b7847421b914ddb4f00 has been finished, abort the checkpoint 
Failure reason: Not all required tasks are currently running.
22:05:31,628 [    Checkpoint Timer] INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Failed to 
trigger checkpoint for job 461c989450b54b7847421b914ddb4f00 since some tasks of 
job 461c989450b54b7847421b914ddb4f00 has been finished, abort the checkpoint 
Failure reason: Not all required tasks are currently running.
22:05:32,128 [    Checkpoint Timer] INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Failed to 
trigger checkpoint for job 461c989450b54b7847421b914ddb4f00 since some tasks of 
job 461c989450b54b7847421b914ddb4f00 has been finished, abort the checkpoint 
Failure reason: Not all required tasks are currently running.
{code})

> KafkaSourceLegacyITCase.testOneToOneSources failed due to "OperatorEvent from 
> an OperatorCoordinator to a task was lost"
> ------------------------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-22488
>                 URL: https://issues.apache.org/jira/browse/FLINK-22488
>             Project: Flink
>          Issue Type: Improvement
>            Reporter: Dong Lin
>            Priority: Minor
>              Labels: pull-request-available
>
> According to [1], the test KafkaSourceLegacyITCase.testOneToOneSources failed 
> because it runs a streaming job (which uses KafkaSource) with 
> restartAttempts=1. In addition to the failover explicitly triggered by the 
> FailingIdentityMapper, the job additionally failed due to 
> "org.apache.flink.util.FlinkException: An OperatorEvent from an 
> OperatorCoordinator to a task was lost. Triggering task failover to ensure 
> consistency", which is unexpected by the test.
> Note that SubtaskGatewayImpl was updated by [2] on 4/14 which triggers task 
> failover if any OperatorEvent was lost. This could explain why those Kafka 
> tests start to fail due to the exception described above.
> In order to make this test stable, let's try to understand why there is such 
> a high chance of loosing OperatorEvent in the Azure test pipeline. And if we 
> could not avoid loosing OperatorEvent in the test pipeline, we probably need 
> to update the test to allow the pipeline being restarted arbitrary times (and 
> still be able to stop the test on the happy path).
> [1] 
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=17212&view=logs&j=c5f0071e-1851-543e-9a45-9ac140befc32&t=1fb1a56f-e8b5-5a82-00a0-a2db7757b4f5&l=6960
> [2] https://github.com/apache/flink/pull/15605



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Issue Comment Deleted] (FLINK-22488) KafkaSourceLegacyITCase.testOneToOneSources failed due to "OperatorEvent from an OperatorCoordinator to a task was lost"

Reply via email to