[
https://issues.apache.org/jira/browse/FLINK-23654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17394453#comment-17394453
]
Zhilong Hong commented on FLINK-23654:
--------------------------------------
Thanks for your analysis, [~raganico]! I'd like to share a different view about
this. The future executor and the IO executor share the same
{{ScheduledExecutorService}}, which is created in JobManagerSharedServices. In
our test related to the task deployment, we find that too many threads would
decrease the performance of deployment. With more number of threads, JobManager
can create TDD faster. But the speed of transporting the TDD and the speed of
dealing with TDD on TaskExecutor don't improve. This will make TDD jammed on
JobManager for a long time, making the GC performance worse. Thus, I think
maybe it's better to split the future executor and the IO executor into two
separated executors.
For each executor, we may add a configuration like
{{jobmanager.future-executor.numberOfThreads}} and
{{jobmanager.io-executor.numberOfThreads}}. The default value may stay the same
as before, i.e., the number of CPU cores on the JobMaster. For advanced users,
they can increase the number of threads for IO executor for better performance
of checkpoints.
Furthermore, would you mind sharing the result of your tests about how the
number of threads affects the performance of checkpoints?
> Allow configurable number of jobmanager-future threads
> ------------------------------------------------------
>
> Key: FLINK-23654
> URL: https://issues.apache.org/jira/browse/FLINK-23654
> Project: Flink
> Issue Type: Improvement
> Components: Runtime / REST
> Reporter: Nicolas Raga
> Priority: Critical
>
> The JobManagerSharedServices futureExecutor is used for asynchronous request
> in multiple Flink components. When the JobMaster creates the execution graph,
> it passes the *scheduledExecutorService* (which is the
> jobManagerSharedServices.getScheduledExecutorService) to both the
> *futureExecutor* and the *ioExecutor.* In the ExecutionGraph, the
> *ioExecutor* is the executor which is used to execute blocking I/O
> operations. It is also passed in to the *CheckpointCoordinator* which uses it
> for asynchronous calls like disposing pending checkpoints, clean up failed
> checkpoints, etc. The *futureExecutor* is even passed on to the *Execution*
> class, which is then used to dispatch callbacks from futures and asynchronous
> RPC calls from within vertexes! Lastly this executor is also used to process
> asynchronous requests from the Flink REST endpoint.
>
> Hence, using the endpoint for monitoring during large checkpoints or blocking
> I/O operations on the same threadpool causes degraded performance on the
> endpoint. We have already been able to test that an increase in this thread
> count allows to faster responses to incoming requests. We can begin by simply
> exposing a *jobmanager.future-thread.factor* that can provide a factor above
> the number of CPU's. Afterwards, we can consider a dedicated thread pool for
> blocking I/O that won't cause degradation of performance for the REST
> endpoint.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)