[ 
https://issues.apache.org/jira/browse/FLINK-4449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15436164#comment-15436164
 ] 

ASF GitHub Bot commented on FLINK-4449:
---------------------------------------

Github user beyond1920 commented on the issue:

    https://github.com/apache/flink/pull/2410
  
    Hi, till. Thanks for reviewing and good advices so much.  I agree we should 
define how should it look like first. And I try to give my opinions  for your 
question.
    1. exponential backoff strategy.  
    In fact, it is not complete exponential backoff. like 'Math.min(2 * 
timeoutMillis, maxHeartbeatTimeout)', Maybe we could use maxHeartbeatTimeout to 
decrease the risk of wait twice as long as defined until notified about a 
heartbeat failure.
    Also we could use constant retry period instead of backoff strategy
    2. whether every heartbeat connection should be responsible for triggering 
itself or whether the heartbeat manager should be responsible for that?
    Every heartbeat scheduler don't trigger itself, it depends on outer 
world(Here i means HeartbeatManager) call it's start method to trigger it.  
    3. Is the heartbeat receiving end an independent RpcEndpoint? How does the 
payload delivery works? Does the sender side asks for the result (future) or 
does the receiving side answers via a tell message to the heartbeat manager?
    On the sender side, receiving end is a gateway which can be got by its 
address. And Sender side ask receiver for the heartbeat payload.
    4. How does receiving end monitor the sender so that if the heartbeat 
request is not delivered, then receiving end could mark sending end as dead?
    I think it could be independent of heartbeat manager on the sending side. 
It should run on the receiving end while heartbeat scheduler run on the sending 
side.
    
    What's your advice?


> Heartbeat Manager between ResourceManager and TaskExecutor
> ----------------------------------------------------------
>
>                 Key: FLINK-4449
>                 URL: https://issues.apache.org/jira/browse/FLINK-4449
>             Project: Flink
>          Issue Type: Sub-task
>          Components: Cluster Management
>            Reporter: zhangjing
>            Assignee: zhangjing
>
> HeartbeatManager is responsible for heartbeat between resourceManager to 
> TaskExecutor
> 1. Register taskExecutors
> register heartbeat targets. If the heartbeat response for these targets is 
> not reported in time, mark target failed and notify resourceManager
> 2. trigger heartbeat
> trigger heartbeat from resourceManager to TaskExecutor periodically
> taskExecutor report slot allocation in the heartbeat response
> ResourceManager sync self slot allocation with the heartbeat response



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to