[
https://issues.apache.org/jira/browse/FLINK-10320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17283767#comment-17283767
]
Zhilong Hong commented on FLINK-10320:
--------------------------------------
Thanks for reminding, [~chesnay] and [~pnowojski].
I think the current scheduler benchmark has covered the goal #1 (i.e. "How fast
the job switch into RUNNING, or say could we start the job faster"). It
involves several procedures with high computation complexity. In fact after the
optimization we make in FLINK 21110, the main throttle of deploying a job lies
in computing pipelined region. We will try to optimize it in future.
For the goal #2 (i.e. "What the throughput JobMaster reacts rpc requests"), we
are still thinking about it. The first concern that comes to me is, the RPC we
mock locally is different from the real situation. First, we cannot simulate
the network connection latency. I think this may greatly impact the performance
of RPC if the communications reach the maximum bandwidth (in the worst
scenario). Second, the thread model is totally different. Currently the future
executor in JobMaster has a thread pool that uses all the CPU cores on the
machine. If we start threads to simulate TaskExecutor on the same machine, the
mocked TM may impact the performance of JobMaster. For example,
{{Execution#submitTask}} runs on future executors, as
{{TaskExecutor#submitTask}} runs on the main thread of TaskExecutors.
> Introduce JobMaster schedule micro-benchmark
> --------------------------------------------
>
> Key: FLINK-10320
> URL: https://issues.apache.org/jira/browse/FLINK-10320
> Project: Flink
> Issue Type: Improvement
> Components: Runtime / Coordination, Tests
> Reporter: Zili Chen
> Priority: Major
>
> Based on {{org.apache.flink.streaming.runtime.io.benchmark}} stuff and the
> repo [flink-benchmark|https://github.com/dataArtisans/flink-benchmarks], I
> proposal to introduce another micro-benchmark which focuses on {{JobMaster}}
> schedule performance
> h3. Target
> Benchmark how long from {{JobMaster}} startup(receive the {{JobGraph}} and
> init) to all tasks RUNNING. Technically we use bounded stream and TM finishes
> tasks as soon as they arrived. So the real interval we measure is to all
> tasks FINISHED.
> h3. Case
> 1. JobGraph that cover EAGER + PIPELINED edges
> 2. JobGraph that cover LAZY_FROM_SOURCES + PIPELINED edges
> 3. JobGraph that cover LAZY_FROM_SOURCES + BLOCKING edges
> ps: maybe benchmark if the source is get from {{InputSplit}}?
> h3. Implement
> Based on the flink-benchmark repo, we finally run benchmark using jmh. So the
> whole test suit is separated into two repos. The testing environment could be
> located in the main repo, maybe under
> flink-runtime/src/test/java/org/apache/flink/runtime/jobmaster/benchmark.
> To measure the performance of {{JobMaster}} scheduling, we need to simulate
> an environment that:
> 1. has a real {{JobMaster}}
> 2. has a mock/testing {{ResourceManager}} that having infinite resource and
> react immediately.
> 3. has a(many?) mock/testing {{TaskExecutor}} that deploy and finish tasks
> immediately.
> [[email protected]] [~GJL] [~pnowojski] could you please review this
> proposal to help clarify the goal and concrete details? Thanks in advance.
> Any suggestions are welcome.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)