[ 
https://issues.apache.org/jira/browse/FLINK-8462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16333998#comment-16333998
 ] 

ASF GitHub Bot commented on FLINK-8462:
---------------------------------------

Github user GJL commented on a diff in the pull request:

    https://github.com/apache/flink/pull/5318#discussion_r162867279
  
    --- Diff: 
flink-runtime/src/main/java/org/apache/flink/runtime/taskexecutor/TaskExecutor.java
 ---
    @@ -1337,11 +1340,16 @@ public void reportPayload(ResourceID resourceID, 
Void payload) {
                @Override
                public void notifyHeartbeatTimeout(final ResourceID resourceId) 
{
                        runAsync(() -> {
    -                           log.info("The heartbeat of ResourceManager with 
id {} timed out.", resourceId);
    +                           // first check whether the timeout is still 
valid
    +                           if (resourceManagerConnection != null && 
resourceManagerConnection.getResourceManagerId().equals(resourceId)) {
    +                                   log.info("The heartbeat of 
ResourceManager with id {} timed out.", resourceId);
     
    -                           closeResourceManagerConnection(
    -                                   new TimeoutException(
    -                                           "The heartbeat of 
ResourceManager with id " + resourceId + " timed out."));
    +                                   closeResourceManagerConnection(
    +                                           new TimeoutException(
    +                                                   "The heartbeat of 
ResourceManager with id " + resourceId + " timed out."));
    +                           } else {
    +                                   log.debug("Received heartbeat timeout 
for outdated ResourceManager connection {}. Ignoring the timeout.", resourceId);
    --- End diff --
    
    nit: *ResourceManager with id* vs *ResourceManager connection {}*
    
    Same argument is logged but one is called a connection, the other one is 
called RM.



> TaskExecutor does not verify RM heartbeat timeouts
> --------------------------------------------------
>
>                 Key: FLINK-8462
>                 URL: https://issues.apache.org/jira/browse/FLINK-8462
>             Project: Flink
>          Issue Type: Bug
>          Components: Distributed Coordination
>    Affects Versions: 1.5.0
>            Reporter: Till Rohrmann
>            Assignee: Till Rohrmann
>            Priority: Major
>              Labels: flip-6
>             Fix For: 1.5.0
>
>
> The {{TaskExecutor}} does neither properly stop RM heartbeats nor does it 
> check whether a RM heartbeat timeout is still valid. As a consequence, it can 
> happen that the {{TaskExecutor}} closes the connection to an active {{RM}} 
> due to an outdated heartbeat timeout.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to