Hi everyone,

In some of my jobs, I occasionally encounter the problem, that some of the task 
managers lose the heartbeat connection to the job manager. The jobmanager did 
not crash, though. Here an excerpt from the dashboard:

Error: java.lang.Exception: TaskManager lost heartbeat connection to JobManager
at 
org.apache.flink.runtime.taskmanager.TaskManager.registerAndRunHeartbeatLoop(TaskManager.java:847)
at 
org.apache.flink.runtime.taskmanager.TaskManager.access$000(TaskManager.java:109)
at org.apache.flink.runtime.taskmanager.TaskManager$1.run(TaskManager.java:365)

I am not sure if this is a bug. I rather figure that the network or jobmanager 
workload is too high, so that somehow the heartbeats do not arrive (on time), 
but that's a mere guess. A first step for me could be to increase the heartbeat 
interval.

Has anyone of you encountered this problem or do you have any ideas on how to 
avoid this issue?

Thanks,
Sebastian

Reply via email to