Hi everyone,

Since Flink 1.5 we have the same heartbeat timeout and interval default
values that are defined as heartbeat.timeout: 50s and heartbeat.interval:
10s. These values were mainly chosen to compensate for lengthy GC pauses
and blocking operations that were executed in the main threads of Flink's
components. Since then, there were quite some advancements wrt the JVM's
GCs and we also got rid of a lot of blocking calls that were executed in
the main thread. Moreover, a long heartbeat.timeout causes long recovery
times in case of a TaskManager loss because the system can only properly
recover after the dead TaskManager has been removed from the scheduler.
Hence, I wanted to propose to change the timeout and interval to:

heartbeat.timeout: 15s
heartbeat.interval: 3s

Since there is no perfect solution that fits all use cases, I would really
like to hear from you what you think about it and how you configure these
heartbeat options. Based on your experience we might actually come up with
better default values that allow us to be resilient but also to detect
failed components fast. FLIP-185 can be found here [1].

[1] https://cwiki.apache.org/confluence/x/GAoBCw

Cheers,
Till

Reply via email to