Hi everyone, Since Flink 1.5 we have the same heartbeat timeout and interval default values that are defined as heartbeat.timeout: 50s and heartbeat.interval: 10s. These values were mainly chosen to compensate for lengthy GC pauses and blocking operations that were executed in the main threads of Flink's components. Since then, there were quite some advancements wrt the JVM's GCs and we also got rid of a lot of blocking calls that were executed in the main thread. Moreover, a long heartbeat.timeout causes long recovery times in case of a TaskManager loss because the system can only properly recover after the dead TaskManager has been removed from the scheduler. Hence, I wanted to propose to change the timeout and interval to:
heartbeat.timeout: 15s heartbeat.interval: 3s Since there is no perfect solution that fits all use cases, I would really like to hear from you what you think about it and how you configure these heartbeat options. Based on your experience we might actually come up with better default values that allow us to be resilient but also to detect failed components fast. FLIP-185 can be found here [1]. [1] https://cwiki.apache.org/confluence/x/GAoBCw Cheers, Till