[ https://issues.apache.org/jira/browse/FLINK-10564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
chenlf updated FLINK-10564: --------------------------- Description: it works fine until the number of tasks is above about 400. There are 600+ tasks(each task handles billion data) running in our cluster now,and the problem is it costs too much time (even time out)when submiting/canceling/querying a task. Recouses like memory,cpu are on normal level. after debuging,we found this method is the culprit: org.apache.flink.runtime.util.LeaderRetrievalUtils.LeaderGatewayListener.notifyLeaderAddress(String, UUID) was: it works fine until the number of tasks is above about 400. There are 600+ tasks(each task handles billion data) running in our cluster now,and the problem is it costs too much time (even time out)when submiting/canceling/querying a task. Recouses like memory,cpu are on normal level. after debuging,we found this method is the ulprit: org.apache.flink.runtime.util.LeaderRetrievalUtils.LeaderGatewayListener.notifyLeaderAddress(String, UUID) > tm costs too much time when communicating with jm > -------------------------------------------------- > > Key: FLINK-10564 > URL: https://issues.apache.org/jira/browse/FLINK-10564 > Project: Flink > Issue Type: Bug > Components: Core, JobManager, TaskManager > Environment: configs are following: > jm > high-availability zookeeper > taskmanager.heap.mb 16384 > taskmanager.memory.preallocate false > taskmanager.numberOfTaskSlots 64 > tm > slots 128 > free slots 0-128 > cpu core 40 > Physical Memory 95gb > free Memory 32gb-50gb > Flink Managed Memory 22gb-35gb > Reporter: chenlf > Priority: Major > Attachments: timeout.log > > > it works fine until the number of tasks is above about 400. > There are 600+ tasks(each task handles billion data) running in our cluster > now,and the problem is it costs too much time (even time out)when > submiting/canceling/querying a task. > Recouses like memory,cpu are on normal level. > after debuging,we found this method is the culprit: > > org.apache.flink.runtime.util.LeaderRetrievalUtils.LeaderGatewayListener.notifyLeaderAddress(String, > UUID) -- This message was sent by Atlassian JIRA (v7.6.3#76005)