To me, it looks like the "jobmanager.max-heartbeat-delay-before-failure.sec" is
only used by the jobmanager to determine dead taskmanagers, but not vice versa.
This is probably fine, because the parameter starts with "jobmanager". However,
the number of missed heartbeats from the jobmanager to the taskmanager seems to
be hard-wired to 3:
TaskManager, ll.335ff.:
// start the heart beats
{
final long interval = GlobalConfiguration.getInteger(
ConfigConstants.TASK_MANAGER_HEARTBEAT_INTERVAL_KEY,
ConfigConstants.DEFAULT_TASK_MANAGER_HEARTBEAT_INTERVAL);
this.heartbeatThread = new Thread() {
@Override
public void run() {
registerAndRunHeartbeatLoop(interval,
MAX_LOST_HEART_BEATS);
}
};
this.heartbeatThread.setName("Heartbeat Thread");
this.heartbeatThread.start();
}
Maybe, we should have a the
"taskmanager.max-heartbeat-delay-before-failure.msec" as well.
-----Original Message-----
From: [email protected] [mailto:[email protected]] On Behalf Of Stephan
Ewen
Sent: Dienstag, 18. November 2014 14:08
To: [email protected]
Subject: Re: Heartbeat lost
The heartbeats currently go through the RPC service which is soon to be
replaced by akka. So any fix there would be temporary.
You can try increasing the thread priority, let us know if it works.
Otherwise you can increase the heart beat timeout via
"jobmanager.max-heartbeat-delay-before-failure.sec". CAREFUL: The keys says
seconds, but the value is in milliseconds. We actually need to fix that
Stephan
On Tue, Nov 18, 2014 at 1:25 PM, Kruse, Sebastian <[email protected]>
wrote:
> I am using the RemoteCollectorOutputFormat (if you recall, Fabian
> Tschirschnitz contributed this) to send the output data to the driver
> which happens to run on the same machine as the jobmanager. In some
> cases, this output becomes huge, I assume this to be the problem.
>
> However, since the heartbeat runs in its own thread, we could assign
> it a higher priority than regular driver/jobmanager code, to avoid the
> suppression of heartbeats. Or do I miss something?
>
> Cheers,
> Sebastian
>
> -----Original Message-----
> From: [email protected] [mailto:[email protected]] On Behalf
> Of Stephan Ewen
> Sent: Dienstag, 18. November 2014 10:57
> To: [email protected]
> Subject: Re: Heartbeat lost
>
> Yes, that sounds like a good idea.
>
> I have experienced that occasionally before, under high parallelism
> and algorithms where the task manager got long garbage collection stalls...
>
> The default timeout (30 seconds) can be aggressive for sich jobs...
>
> Stephan
> Am 18.11.2014 09:47 schrieb "Kruse, Sebastian" <[email protected]>:
>
> > Hi everyone,
> >
> > In some of my jobs, I occasionally encounter the problem, that some
> > of the task managers lose the heartbeat connection to the job manager.
> > The jobmanager did not crash, though. Here an excerpt from the dashboard:
> >
> > Error: java.lang.Exception: TaskManager lost heartbeat connection to
> > JobManager at
> > org.apache.flink.runtime.taskmanager.TaskManager.registerAndRunHeart
> > be
> > atLoop(TaskManager.java:847)
> > at
> > org.apache.flink.runtime.taskmanager.TaskManager.access$000(TaskMana
> > ge
> > r.java:109)
> > at
> > org.apache.flink.runtime.taskmanager.TaskManager$1.run(TaskManager.j
> > av
> > a:365)
> >
> > I am not sure if this is a bug. I rather figure that the network or
> > jobmanager workload is too high, so that somehow the heartbeats do
> > not arrive (on time), but that's a mere guess. A first step for me
> > could be to increase the heartbeat interval.
> >
> > Has anyone of you encountered this problem or do you have any ideas
> > on how to avoid this issue?
> >
> > Thanks,
> > Sebastian
> >
>