[ 
https://issues.apache.org/jira/browse/FLINK-10320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16613148#comment-16613148
 ] 

陈梓立 commented on FLINK-10320:
-----------------------------

Taking another reconsideration, what it comes to scheduling, there are two main 
topics we are concerned.
 # Correctness, that is, {{JobMaster}} is able to schedule {{ExecuteGraph}}, 
switch it into RUNNING and optional FINISHED while tolerant failure. This is 
out of this thread.
 # Performance, which represent {{JobMaster}}'s ability to react rpc requests, 
during resource requesting, task deploying and execution graph state maintain.

For the performance part, there are two targets we are interested in.
 # How fast the job switch into RUNNING, or say could we start the job faster. 
FLINK-10038
 # What the throughput {{JobMaster}} reacts rpc requests.

The latter target would be another thread discussing how to monitor/metric rpc 
service, and here is about the former.

By offer slots as soon as slot requests arrived and finish task immediately, we 
get rid of the influence of time spent by component except JM. So we measure 
for a certain parallelism and {{JobGraph}}. NOTE THAT with the draft above I am 
not aimed at providing a score here to show that a change improve scheduling 
performance, but aimed at giving a regression sentinel that alert if a relative 
change cause schedule regression.

I'd like to explore and give out a credible SCHEDULE benchmark. [~pnowojski] 
notice you say "Is it time critical thing? I guess that at least in most 
cases/scenarios/setups no.", could you share what is critical of SCHEDULE in 
your opinion? FYI apart from correctness issue we have met {{JobMaster}} 
unavailable due to rpc requests crash it.

Looking forward to you reply : -)

> Introduce JobMaster schedule micro-benchmark
> --------------------------------------------
>
>                 Key: FLINK-10320
>                 URL: https://issues.apache.org/jira/browse/FLINK-10320
>             Project: Flink
>          Issue Type: Improvement
>          Components: Tests
>            Reporter: 陈梓立
>            Assignee: 陈梓立
>            Priority: Major
>
> Based on {{org.apache.flink.streaming.runtime.io.benchmark}} stuff and the 
> repo [flink-benchmark|https://github.com/dataArtisans/flink-benchmarks], I 
> proposal to introduce another micro-benchmark which focuses on {{JobMaster}} 
> schedule performance
> h3. Target
> Benchmark how long from {{JobMaster}} startup(receive the {{JobGraph}} and 
> init) to all tasks RUNNING. Technically we use bounded stream and TM finishes 
> tasks as soon as they arrived. So the real interval we measure is to all 
> tasks FINISHED.
> h3. Case
> 1. JobGraph that cover EAGER + PIPELINED edges
> 2. JobGraph that cover LAZY_FROM_SOURCES + PIPELINED edges
> 3. JobGraph that cover LAZY_FROM_SOURCES + BLOCKING edges
> ps: maybe benchmark if the source is get from {{InputSplit}}?
> h3. Implement
> Based on the flink-benchmark repo, we finally run benchmark using jmh. So the 
> whole test suit is separated into two repos. The testing environment could be 
> located in the main repo, maybe under 
> flink-runtime/src/test/java/org/apache/flink/runtime/jobmaster/benchmark.
> To measure the performance of {{JobMaster}} scheduling, we need to simulate 
> an environment that:
> 1. has a real {{JobMaster}}
> 2. has a mock/testing {{ResourceManager}} that having infinite resource and 
> react immediately.
> 3. has a(many?) mock/testing {{TaskExecutor}} that deploy and finish tasks 
> immediately.
> [~trohrm...@apache.org] [~GJL] [~pnowojski] could you please review this 
> proposal to help clarify the goal and concrete details? Thanks in advance.
> Any suggestions are welcome.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to