Zhu Zhu created FLINK-10413:
-------------------------------
Summary: requestPartitionState messages overwhelms JM RPC main
thread
Key: FLINK-10413
URL: https://issues.apache.org/jira/browse/FLINK-10413
Project: Flink
Issue Type: Bug
Components: Distributed Coordination
Affects Versions: 1.7.0
Reporter: Zhu Zhu
We tried to benchmark the job scheduling performance with a 2000x2000
ALL-to-ALL streaming(EAGER) job. The input data is empty so the tasks finishes
soon after started.
In this case we see slow RPC responses and TM/RM heartbeats to JM will finally
timeout.
We find ~2,000,000 requestPartitionState messages triggered by
triggerPartitionProducerStateCheck in a short time, which overwhelms JM RPC
main thread. This is due to downstream tasks can be started earlier than
upstream tasks in EAGER scheduling.
We's suggest no partition producer state check to avoid this issue. The task
can just keep waiting for a while and retrying if the partition does not exist.
There are two cases when the partition does not exist:
# the partition is not started yet
# the partition is failed
In case 1, retry works. In case 2, a task failover will soon happen and cancel
the downstream tasks as well.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)