[
https://issues.apache.org/jira/browse/FLINK-22677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jin Xing updated FLINK-22677:
-----------------------------
Description: Current scheduler enforces a synchronous registration though
the API of ShuffleMaster#registerPartitionWithProducer returns a
CompletableFuture. In scenario of remote shuffle service, the talk between
ShuffleMaster and remote cluster tends to be expensive. A synchronous
registration risks to block main thread potentially and might cause negative
side effects like heartbeat timeout. Additionally, expensive synchronous
invokes to remote could bottleneck the throughput for applying shuffle
resource, especially for batch jobs with complicated DAGs; (was: Current
scheduler enforces a synchronous registration though the API of
ShuffleMaster#registerPartitionWithProducer returns a CompletableFuture. In
scenario of remote shuffle service, the talk between ShuffleMaster and remote
cluster tends to be expensive. A synchronous registration risks to block main
thread potentially and might cause negative side effects like heartbeat timeout.
Additionally, expensive synchronous invokes to remote could bottleneck the
throughput for applying shuffle resource, especially for batch jobs with
complicated DAGs;)
> Scheduler should invoke ShuffleMaster#registerPartitionWithProducer by a real
> asynchronous fashion
> --------------------------------------------------------------------------------------------------
>
> Key: FLINK-22677
> URL: https://issues.apache.org/jira/browse/FLINK-22677
> Project: Flink
> Issue Type: Sub-task
> Components: Runtime / Coordination
> Reporter: Jin Xing
> Priority: Major
>
> Current scheduler enforces a synchronous registration though the API of
> ShuffleMaster#registerPartitionWithProducer returns a CompletableFuture. In
> scenario of remote shuffle service, the talk between ShuffleMaster and remote
> cluster tends to be expensive. A synchronous registration risks to block main
> thread potentially and might cause negative side effects like heartbeat
> timeout. Additionally, expensive synchronous invokes to remote could
> bottleneck the throughput for applying shuffle resource, especially for batch
> jobs with complicated DAGs;
--
This message was sent by Atlassian Jira
(v8.3.4#803005)