Have you evaluated to adopt reactor instead of akka? On Nov 18, 2014 10:57 AM, "Stephan Ewen" <[email protected]> wrote:
> Yes, that sounds like a good idea. > > I have experienced that occasionally before, under high parallelism and > algorithms where the task manager got long garbage collection stalls... > > The default timeout (30 seconds) can be aggressive for sich jobs... > > Stephan > Am 18.11.2014 09:47 schrieb "Kruse, Sebastian" <[email protected]>: > > > Hi everyone, > > > > In some of my jobs, I occasionally encounter the problem, that some of > the > > task managers lose the heartbeat connection to the job manager. The > > jobmanager did not crash, though. Here an excerpt from the dashboard: > > > > Error: java.lang.Exception: TaskManager lost heartbeat connection to > > JobManager > > at > > > org.apache.flink.runtime.taskmanager.TaskManager.registerAndRunHeartbeatLoop(TaskManager.java:847) > > at > > > org.apache.flink.runtime.taskmanager.TaskManager.access$000(TaskManager.java:109) > > at > > > org.apache.flink.runtime.taskmanager.TaskManager$1.run(TaskManager.java:365) > > > > I am not sure if this is a bug. I rather figure that the network or > > jobmanager workload is too high, so that somehow the heartbeats do not > > arrive (on time), but that's a mere guess. A first step for me could be > to > > increase the heartbeat interval. > > > > Has anyone of you encountered this problem or do you have any ideas on > how > > to avoid this issue? > > > > Thanks, > > Sebastian > > >
