Re: Heartbeat lost

Flavio Pompermaier Tue, 18 Nov 2014 08:28:08 -0800

Have you evaluated to adopt reactor instead of akka?
On Nov 18, 2014 10:57 AM, "Stephan Ewen" <[email protected]> wrote:


> Yes, that sounds like a good idea.
>
> I have experienced that occasionally before, under high parallelism and
> algorithms where the task manager got long garbage collection stalls...
>
> The default timeout (30 seconds) can be aggressive for sich jobs...
>
> Stephan
> Am 18.11.2014 09:47 schrieb "Kruse, Sebastian" <[email protected]>:
>
> > Hi everyone,
> >
> > In some of my jobs, I occasionally encounter the problem, that some of
> the
> > task managers lose the heartbeat connection to the job manager. The
> > jobmanager did not crash, though. Here an excerpt from the dashboard:
> >
> > Error: java.lang.Exception: TaskManager lost heartbeat connection to
> > JobManager
> > at
> >
> org.apache.flink.runtime.taskmanager.TaskManager.registerAndRunHeartbeatLoop(TaskManager.java:847)
> > at
> >
> org.apache.flink.runtime.taskmanager.TaskManager.access$000(TaskManager.java:109)
> > at
> >
> org.apache.flink.runtime.taskmanager.TaskManager$1.run(TaskManager.java:365)
> >
> > I am not sure if this is a bug. I rather figure that the network or
> > jobmanager workload is too high, so that somehow the heartbeats do not
> > arrive (on time), but that's a mere guess. A first step for me could be
> to
> > increase the heartbeat interval.
> >
> > Has anyone of you encountered this problem or do you have any ideas on
> how
> > to avoid this issue?
> >
> > Thanks,
> > Sebastian
> >
>

Re: Heartbeat lost

Reply via email to