[ https://issues.apache.org/jira/browse/FLINK-4449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15436466#comment-15436466 ]
ASF GitHub Bot commented on FLINK-4449: --------------------------------------- Github user tillrohrmann commented on the issue: https://github.com/apache/flink/pull/2410 Hi @beyond1920, first let me comment on the different points you've raised. 1. I'm actually not sure whether the exponential backoff strategy makes sense. The `maxHeartbeatTimeout` won't solve the underlying problem as you timeout can then be up to `2*maxHeartbeatTimeout`. I guess the best approach would be to have a periodic heartbeat interval which is adaptive wrt load. But I'm not sure here. 2. It's correct that the `HeartbeatManager` starts the `HeartbeatScheduler`. However, from then on, the `HeartbeatScheduler` are responsible for the triggering. In order to control the heartbeat load you would need to coordinate it on the `HeartbeatManager`-level. 3. Sure, but the question is whether the RM, TM and JM have to implement an interface for it (adding it to their RPC contract) or whether we have an independent component responsible for the heart beating. Furthermore, there is a difference if you reply to an RPC call (give the result back as a future) or whether you send yourself an RPC to the sender. The former has the problem that timed out futures will be ignored and thus all the work you've done on the receiver side has to be re-done for the next heartbeat trigger message. 4. I'm actually not so sure whether the components are so different. Logically the receiving and sending side should have the same heartbeat timeout detection logic. The only difference should be that the sender initiates the heartbeat and the receiver answers to it. One could also think about having on each side an heartbeat sender instance. Anyway, I think that the receiving and sending side should be developed together and not considered separate parts. I think we should first concentrate on these questions to agree on a design for the heartbeat component. This should happen in the JIRA issue. Furthermore, I think you still have to clean up the PR because it contains unrelated changes. So best if you close it and then we concentrate first on the design. Once this is done, you can open a new PR with the implementation. > Heartbeat Manager between ResourceManager and TaskExecutor > ---------------------------------------------------------- > > Key: FLINK-4449 > URL: https://issues.apache.org/jira/browse/FLINK-4449 > Project: Flink > Issue Type: Sub-task > Components: Cluster Management > Reporter: zhangjing > Assignee: zhangjing > > HeartbeatManager is responsible for heartbeat between resourceManager to > TaskExecutor > 1. Register taskExecutors > register heartbeat targets. If the heartbeat response for these targets is > not reported in time, mark target failed and notify resourceManager > 2. trigger heartbeat > trigger heartbeat from resourceManager to TaskExecutor periodically > taskExecutor report slot allocation in the heartbeat response > ResourceManager sync self slot allocation with the heartbeat response -- This message was sent by Atlassian JIRA (v6.3.4#6332)