[
https://issues.apache.org/jira/browse/FLINK-22488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dawid Wysakowicz updated FLINK-22488:
-------------------------------------
Comment: was deleted
(was: One observation:
* some of the sub-tasks finish right away after starting, which makes it
impossible to take any checkpoints/savepoints
{code}
[Source: KafkaSource -> Map -> Map (3/8)#1] INFO
org.apache.flink.connector.base.source.reader.SourceReaderBase [] - Closing
Source Reader.
22:02:57,135 [Source: KafkaSource -> Map -> Map (5/8)#1] INFO
org.apache.flink.connector.base.source.reader.SourceReaderBase [] - Reader
received NoMoreSplits event.
22:02:57,136 [Source: KafkaSource -> Map -> Map (4/8)#1] INFO
org.apache.flink.runtime.taskmanager.Task [] - Source:
KafkaSource -> Map -> Map (4/8)#1 (334496a6fd2c371dd842e7ecb61811d9) switched
from RUNNING to FINISHED.
22:05:27,161 [FailingIdentityMapper Status Printer] INFO
org.apache.flink.streaming.connectors.kafka.testutils.FailingIdentityMapper []
- ============================> Failing mapper 6: count=1000, totalCount=1000
22:05:27,162 [FailingIdentityMapper Status Printer] INFO
org.apache.flink.streaming.connectors.kafka.testutils.FailingIdentityMapper []
- ============================> Failing mapper 1: count=1000, totalCount=1000
22:05:27,163 [FailingIdentityMapper Status Printer] INFO
org.apache.flink.streaming.connectors.kafka.testutils.FailingIdentityMapper []
- ============================> Failing mapper 7: count=1000, totalCount=1000
22:05:27,163 [FailingIdentityMapper Status Printer] INFO
org.apache.flink.streaming.connectors.kafka.testutils.FailingIdentityMapper []
- ============================> Failing mapper 0: count=1000, totalCount=1000
22:05:27,163 [FailingIdentityMapper Status Printer] INFO
org.apache.flink.streaming.connectors.kafka.testutils.FailingIdentityMapper []
- ============================> Failing mapper 5: count=1000, totalCount=1000
22:05:27,628 [ Checkpoint Timer] INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Failed to
trigger checkpoint for job 461c989450b54b7847421b914ddb4f00 since some tasks of
job 461c989450b54b7847421b914ddb4f00 has been finished, abort the checkpoint
Failure reason: Not all required tasks are currently running.
22:05:28,128 [ Checkpoint Timer] INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Failed to
trigger checkpoint for job 461c989450b54b7847421b914ddb4f00 since some tasks of
job 461c989450b54b7847421b914ddb4f00 has been finished, abort the checkpoint
Failure reason: Not all required tasks are currently running.
22:05:28,628 [ Checkpoint Timer] INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Failed to
trigger checkpoint for job 461c989450b54b7847421b914ddb4f00 since some tasks of
job 461c989450b54b7847421b914ddb4f00 has been finished, abort the checkpoint
Failure reason: Not all required tasks are currently running.
22:05:29,128 [ Checkpoint Timer] INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Failed to
trigger checkpoint for job 461c989450b54b7847421b914ddb4f00 since some tasks of
job 461c989450b54b7847421b914ddb4f00 has been finished, abort the checkpoint
Failure reason: Not all required tasks are currently running.
22:05:29,627 [ Checkpoint Timer] INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Failed to
trigger checkpoint for job 461c989450b54b7847421b914ddb4f00 since some tasks of
job 461c989450b54b7847421b914ddb4f00 has been finished, abort the checkpoint
Failure reason: Not all required tasks are currently running.
22:05:30,128 [ Checkpoint Timer] INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Failed to
trigger checkpoint for job 461c989450b54b7847421b914ddb4f00 since some tasks of
job 461c989450b54b7847421b914ddb4f00 has been finished, abort the checkpoint
Failure reason: Not all required tasks are currently running.
22:05:30,628 [ Checkpoint Timer] INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Failed to
trigger checkpoint for job 461c989450b54b7847421b914ddb4f00 since some tasks of
job 461c989450b54b7847421b914ddb4f00 has been finished, abort the checkpoint
Failure reason: Not all required tasks are currently running.
22:05:31,128 [ Checkpoint Timer] INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Failed to
trigger checkpoint for job 461c989450b54b7847421b914ddb4f00 since some tasks of
job 461c989450b54b7847421b914ddb4f00 has been finished, abort the checkpoint
Failure reason: Not all required tasks are currently running.
22:05:31,628 [ Checkpoint Timer] INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Failed to
trigger checkpoint for job 461c989450b54b7847421b914ddb4f00 since some tasks of
job 461c989450b54b7847421b914ddb4f00 has been finished, abort the checkpoint
Failure reason: Not all required tasks are currently running.
22:05:32,128 [ Checkpoint Timer] INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Failed to
trigger checkpoint for job 461c989450b54b7847421b914ddb4f00 since some tasks of
job 461c989450b54b7847421b914ddb4f00 has been finished, abort the checkpoint
Failure reason: Not all required tasks are currently running.
{code})
> KafkaSourceLegacyITCase.testOneToOneSources failed due to "OperatorEvent from
> an OperatorCoordinator to a task was lost"
> ------------------------------------------------------------------------------------------------------------------------
>
> Key: FLINK-22488
> URL: https://issues.apache.org/jira/browse/FLINK-22488
> Project: Flink
> Issue Type: Improvement
> Reporter: Dong Lin
> Priority: Minor
> Labels: pull-request-available
>
> According to [1], the test KafkaSourceLegacyITCase.testOneToOneSources failed
> because it runs a streaming job (which uses KafkaSource) with
> restartAttempts=1. In addition to the failover explicitly triggered by the
> FailingIdentityMapper, the job additionally failed due to
> "org.apache.flink.util.FlinkException: An OperatorEvent from an
> OperatorCoordinator to a task was lost. Triggering task failover to ensure
> consistency", which is unexpected by the test.
> Note that SubtaskGatewayImpl was updated by [2] on 4/14 which triggers task
> failover if any OperatorEvent was lost. This could explain why those Kafka
> tests start to fail due to the exception described above.
> In order to make this test stable, let's try to understand why there is such
> a high chance of loosing OperatorEvent in the Azure test pipeline. And if we
> could not avoid loosing OperatorEvent in the test pipeline, we probably need
> to update the test to allow the pipeline being restarted arbitrary times (and
> still be able to stop the test on the happy path).
> [1]
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=17212&view=logs&j=c5f0071e-1851-543e-9a45-9ac140befc32&t=1fb1a56f-e8b5-5a82-00a0-a2db7757b4f5&l=6960
> [2] https://github.com/apache/flink/pull/15605
--
This message was sent by Atlassian Jira
(v8.3.4#803005)