I am using the RemoteCollectorOutputFormat (if you recall, Fabian Tschirschnitz 
contributed this) to send the output data to the driver which happens to run on 
the same machine as the jobmanager. In some cases, this output becomes huge, I 
assume this to be the problem.

However, since the heartbeat runs in its own thread, we could assign it a 
higher priority than regular driver/jobmanager code, to avoid the suppression 
of heartbeats. Or do I miss something?

Cheers,
Sebastian

-----Original Message-----
From: [email protected] [mailto:[email protected]] On Behalf Of Stephan 
Ewen
Sent: Dienstag, 18. November 2014 10:57
To: [email protected]
Subject: Re: Heartbeat lost

Yes, that sounds like a good idea.

I have experienced that occasionally before, under high parallelism and 
algorithms where the task manager got long garbage collection stalls...

The default timeout (30 seconds) can be aggressive for sich jobs...

Stephan
Am 18.11.2014 09:47 schrieb "Kruse, Sebastian" <[email protected]>:

> Hi everyone,
>
> In some of my jobs, I occasionally encounter the problem, that some of 
> the task managers lose the heartbeat connection to the job manager. 
> The jobmanager did not crash, though. Here an excerpt from the dashboard:
>
> Error: java.lang.Exception: TaskManager lost heartbeat connection to 
> JobManager at
> org.apache.flink.runtime.taskmanager.TaskManager.registerAndRunHeartbe
> atLoop(TaskManager.java:847)
> at
> org.apache.flink.runtime.taskmanager.TaskManager.access$000(TaskManage
> r.java:109)
> at
> org.apache.flink.runtime.taskmanager.TaskManager$1.run(TaskManager.jav
> a:365)
>
> I am not sure if this is a bug. I rather figure that the network or 
> jobmanager workload is too high, so that somehow the heartbeats do not 
> arrive (on time), but that's a mere guess. A first step for me could 
> be to increase the heartbeat interval.
>
> Has anyone of you encountered this problem or do you have any ideas on 
> how to avoid this issue?
>
> Thanks,
> Sebastian
>

Reply via email to