[
https://issues.apache.org/jira/browse/FLINK-20612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17251062#comment-17251062
]
Zhu Zhu commented on FLINK-20612:
---------------------------------
I have assigned you the ticket [~Thesharing]. Feel free to open a PR.
> Add benchmarks for scheduler
> ----------------------------
>
> Key: FLINK-20612
> URL: https://issues.apache.org/jira/browse/FLINK-20612
> Project: Flink
> Issue Type: Improvement
> Components: Runtime / Coordination
> Affects Versions: 1.13.0
> Reporter: Zhilong Hong
> Priority: Major
>
> With Flink 1.12, we failed to run large-scale jobs on our cluster. When we
> were trying to run the jobs, we met the exceptions like out of heap memory,
> taskmanager heartbeat timeout, and etc. We increased the size of heap memory
> and extended the heartbeat timeout, the job still failed. After the
> troubleshooting, we found that there are some performance bottlenecks in the
> jobmaster. These bottlenecks are highly related to the complexity of the
> topology.
> We implemented several benchmarks on these bottlenecks based on
> flink-benchmark. The topology of the benchmarks is a simple graph, which
> consists of only two vertices: one source vertex and one sink vertex. They
> are both connected with all-to-all blocking edges. The parallelisms of the
> vertices are both 8000. The execution mode is batch. The results of the
> benchmarks are illustrated below:
> Table 1: The result of benchmarks on bottlenecks in the jobmaster
> | |*Time spent*|
> |Build topology|19970.44 ms|
> |Init scheduling strategy|38167.351 ms|
> |Deploy tasks|15102.850 ms|
> |Calculate failover region to restart|12080.271 ms|
> We'd like to propose these benchmarks for procedures related to the
> scheduler. There are three main benefits:
> # They help us to understand the current status of task deployment
> performance and locate where the bottleneck is.
> # We can use the benchmarks to evaluate the optimization in the future.
> # As we run the benchmarks daily, they will help us to trace how the
> performance changes and locate the commit that introduces the performance
> regression if there is any.
> In the first version of the benchmarks, we mainly focus on the procedures we
> mentioned above. The methods corresponding to the procedures are:
> # Building topology: {{ExecutionGraph#attachJobGraph}}
> # Initializing scheduling strategies:
> {{PipelinedRegionSchedulingStrategy#init}}
> # Deploying tasks: {{Execution#deploy}}
> # Calculating failover regions:
> {{RestartPipelinedRegionFailoverStrategy#getTasksNeedingRestart}}
> In the benchmarks, the topology consists of two vertices: source -> sink.
> They are connected with all-to-all edges. The result partition type
> ({{PIPELINED}} and {{BLOCKING}}) should be considered separately.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)