[jira] [Created] (FLINK-34006) Flink terminates the execution of an application when there is a network problem between TaskManagers

Sophia (Jira) Fri, 05 Jan 2024 15:30:38 -0800

Sophia created FLINK-34006:
------------------------------

             Summary: Flink terminates the execution of an application when 
there is a network problem between TaskManagers
                 Key: FLINK-34006
                 URL: https://issues.apache.org/jira/browse/FLINK-34006
             Project: Flink
          Issue Type: Bug
          Components: Runtime / Checkpointing, Runtime / Task
    Affects Versions: 1.17.1
            Reporter: Sophia



Flink terminates an application when two TaskManager are disconnected although 
there are enough resources in the cluster to run the application and we use 
checkpoint restart.

 

We deploy Flink v(1.17.1) on a cluster of six nodes with Ubuntu 18.04, the 
cluster consists of a JobManager and five TaskManagers. We use Flink's 
Standalone resource manager. We set the number of slots per TaskManager to one, 
and submit a WordCount application with a level of parallelism equal to three. 
We enable Flink checkpointing and restart failover strategy to attempt a 
restart in case of failure three times before termination and the time between 
attempts to 10 seconds.

The application starts running on the first 3 TaskManager.

If the communication is broken between two of the TaskManager that run the 
application, the job fails, and the JobManager tries to restart the job again. 
When the job fails the resources on the TaskManager are free. When the 
JobManager restarts the job, it selects the same three TaskManager it choose in 
the first attempt, and the job fails again. After three trials, Flink 
terminates the job with an exception: Connecting to remote task manager has 
failed.

 

These are the JobManager Configurations:
 * taskmanager.numberOfTaskSlots: 1
 * Enable checkpointing: --checkpointing
 * execution.checkpointing.interval: 3min
 * Enabling restart failover strategy
 * restart-strategy.type: fixed-delay
 * restart-strategy.fixed-delay.attempts: 3
 * restart-strategy.fixed-delay.delay: 10 s

 

command: ./bin/flink run -p 3 examples/streaming/WordCount.jar --checkpointing 
--input ~/flink/alice.txt



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (FLINK-34006) Flink terminates the execution of an application when there is a network problem between TaskManagers

Reply via email to