Sophia created FLINK-34006:
------------------------------
Summary: Flink terminates the execution of an application when
there is a network problem between TaskManagers
Key: FLINK-34006
URL: https://issues.apache.org/jira/browse/FLINK-34006
Project: Flink
Issue Type: Bug
Components: Runtime / Checkpointing, Runtime / Task
Affects Versions: 1.17.1
Reporter: Sophia
Flink terminates an application when two TaskManager are disconnected although
there are enough resources in the cluster to run the application and we use
checkpoint restart.
We deploy Flink v(1.17.1) on a cluster of six nodes with Ubuntu 18.04, the
cluster consists of a JobManager and five TaskManagers. We use Flink's
Standalone resource manager. We set the number of slots per TaskManager to one,
and submit a WordCount application with a level of parallelism equal to three.
We enable Flink checkpointing and restart failover strategy to attempt a
restart in case of failure three times before termination and the time between
attempts to 10 seconds.
The application starts running on the first 3 TaskManager.
If the communication is broken between two of the TaskManager that run the
application, the job fails, and the JobManager tries to restart the job again.
When the job fails the resources on the TaskManager are free. When the
JobManager restarts the job, it selects the same three TaskManager it choose in
the first attempt, and the job fails again. After three trials, Flink
terminates the job with an exception: Connecting to remote task manager has
failed.
These are the JobManager Configurations:
* taskmanager.numberOfTaskSlots: 1
* Enable checkpointing: --checkpointing
* execution.checkpointing.interval: 3min
* Enabling restart failover strategy
* restart-strategy.type: fixed-delay
* restart-strategy.fixed-delay.attempts: 3
* restart-strategy.fixed-delay.delay: 10 s
command: ./bin/flink run -p 3 examples/streaming/WordCount.jar --checkpointing
--input ~/flink/alice.txt
--
This message was sent by Atlassian Jira
(v8.20.10#820010)