[ 
https://issues.apache.org/jira/browse/FLINK-4449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15436466#comment-15436466
 ] 

ASF GitHub Bot commented on FLINK-4449:
---------------------------------------

Github user tillrohrmann commented on the issue:

    https://github.com/apache/flink/pull/2410
  
    Hi @beyond1920,
    
    first let me comment on the different points you've raised.
    
    1. I'm actually not sure whether the exponential backoff strategy makes 
sense. The `maxHeartbeatTimeout` won't solve the underlying problem as you 
timeout can then be up to `2*maxHeartbeatTimeout`. I guess the best approach 
would be to have a periodic heartbeat interval which is adaptive wrt load. But 
I'm not sure here.
    
    2. It's correct that the `HeartbeatManager` starts the 
`HeartbeatScheduler`. However, from then on, the `HeartbeatScheduler` are 
responsible for the triggering. In order to control the heartbeat load you 
would need to coordinate it on the `HeartbeatManager`-level.
    
    3. Sure, but the question is whether the RM, TM and JM have to implement an 
interface for it (adding it to their RPC contract) or whether we have an 
independent component responsible for the heart beating. Furthermore, there is 
a difference if you reply to an RPC call (give the result back as a future) or 
whether you send yourself an RPC to the sender. The former has the problem that 
timed out futures will be ignored and thus all the work you've done on the 
receiver side has to be re-done for the next heartbeat trigger message.
    
    4. I'm actually not so sure whether the components are so different. 
Logically the receiving and sending side should have the same heartbeat timeout 
detection logic. The only difference should be that the sender initiates the 
heartbeat and the receiver answers to it. One could also think about having on 
each side an heartbeat sender instance. Anyway, I think that the receiving and 
sending side should be developed together and not considered separate parts.
    
    I think we should first concentrate on these questions to agree on a design 
for the heartbeat component. This should happen in the JIRA issue.
    
    Furthermore, I think you still have to clean up the PR because it contains 
unrelated changes. So best if you close it and then we concentrate first on the 
design. Once this is done, you can open a new PR with the implementation.


> Heartbeat Manager between ResourceManager and TaskExecutor
> ----------------------------------------------------------
>
>                 Key: FLINK-4449
>                 URL: https://issues.apache.org/jira/browse/FLINK-4449
>             Project: Flink
>          Issue Type: Sub-task
>          Components: Cluster Management
>            Reporter: zhangjing
>            Assignee: zhangjing
>
> HeartbeatManager is responsible for heartbeat between resourceManager to 
> TaskExecutor
> 1. Register taskExecutors
> register heartbeat targets. If the heartbeat response for these targets is 
> not reported in time, mark target failed and notify resourceManager
> 2. trigger heartbeat
> trigger heartbeat from resourceManager to TaskExecutor periodically
> taskExecutor report slot allocation in the heartbeat response
> ResourceManager sync self slot allocation with the heartbeat response



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to