Sophia created FLINK-34006: ------------------------------ Summary: Flink terminates the execution of an application when there is a network problem between TaskManagers Key: FLINK-34006 URL: https://issues.apache.org/jira/browse/FLINK-34006 Project: Flink Issue Type: Bug Components: Runtime / Checkpointing, Runtime / Task Affects Versions: 1.17.1 Reporter: Sophia
Flink terminates an application when two TaskManager are disconnected although there are enough resources in the cluster to run the application and we use checkpoint restart. We deploy Flink v(1.17.1) on a cluster of six nodes with Ubuntu 18.04, the cluster consists of a JobManager and five TaskManagers. We use Flink's Standalone resource manager. We set the number of slots per TaskManager to one, and submit a WordCount application with a level of parallelism equal to three. We enable Flink checkpointing and restart failover strategy to attempt a restart in case of failure three times before termination and the time between attempts to 10 seconds. The application starts running on the first 3 TaskManager. If the communication is broken between two of the TaskManager that run the application, the job fails, and the JobManager tries to restart the job again. When the job fails the resources on the TaskManager are free. When the JobManager restarts the job, it selects the same three TaskManager it choose in the first attempt, and the job fails again. After three trials, Flink terminates the job with an exception: Connecting to remote task manager has failed. These are the JobManager Configurations: * taskmanager.numberOfTaskSlots: 1 * Enable checkpointing: --checkpointing * execution.checkpointing.interval: 3min * Enabling restart failover strategy * restart-strategy.type: fixed-delay * restart-strategy.fixed-delay.attempts: 3 * restart-strategy.fixed-delay.delay: 10 s command: ./bin/flink run -p 3 examples/streaming/WordCount.jar --checkpointing --input ~/flink/alice.txt -- This message was sent by Atlassian Jira (v8.20.10#820010)