[ 
https://issues.apache.org/jira/browse/FLINK-10564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chenlf updated FLINK-10564:
---------------------------
    Description: 
it works fine until the number of tasks is above about 400.
 There are 600+ tasks(each task handles billion data) running in our cluster 
now,and the problem is it costs too much time (even time out)when 
submiting/canceling/querying a task.
 Recouses like memory,cpu are on normal level.

after debuging,we found this method is the culprit:
 
org.apache.flink.runtime.util.LeaderRetrievalUtils.LeaderGatewayListener.notifyLeaderAddress(String,
 UUID)

  was:
it works fine until the number of tasks is above about 400.
There are  600+ tasks(each task handles billion data) running in our cluster 
now,and the problem is it costs too much time (even time out)when 
submiting/canceling/querying a task.
Recouses like memory,cpu are on normal level.

after debuging,we found this method is the ulprit:
org.apache.flink.runtime.util.LeaderRetrievalUtils.LeaderGatewayListener.notifyLeaderAddress(String,
 UUID)


> tm costs too much time when communicating with  jm
> --------------------------------------------------
>
>                 Key: FLINK-10564
>                 URL: https://issues.apache.org/jira/browse/FLINK-10564
>             Project: Flink
>          Issue Type: Bug
>          Components: Core, JobManager, TaskManager
>         Environment: configs are following:
> jm
> high-availability     zookeeper
> taskmanager.heap.mb   16384
> taskmanager.memory.preallocate        false
> taskmanager.numberOfTaskSlots 64
> tm
> slots 128
> free slots 0-128
> cpu core 40 
> Physical Memory 95gb
> free Memory 32gb-50gb
> Flink Managed Memory 22gb-35gb
>            Reporter: chenlf
>            Priority: Major
>         Attachments: timeout.log
>
>
> it works fine until the number of tasks is above about 400.
>  There are 600+ tasks(each task handles billion data) running in our cluster 
> now,and the problem is it costs too much time (even time out)when 
> submiting/canceling/querying a task.
>  Recouses like memory,cpu are on normal level.
> after debuging,we found this method is the culprit:
>  
> org.apache.flink.runtime.util.LeaderRetrievalUtils.LeaderGatewayListener.notifyLeaderAddress(String,
>  UUID)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to