[
https://issues.apache.org/jira/browse/FLINK-29339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17610447#comment-17610447
]
Chesnay Schepler commented on FLINK-29339:
------------------------------------------
> the actual numbers of the rpc request is limited.
The numbers are irrelevant though, as is the size of the response.
You just need a single request to time out (e.g., due to the RM actor crashing)
to potentially crash the entire cluster because of heartbeat timeouts.
You make a good point that the fix is quite involved though, and it is indeed
only called if cluster partitions are actually used. So I'm fine with not
treating it as a blocker.
> JobMasterPartitionTrackerImpl#requestShuffleDescriptorsFromResourceManager
> blocks main thread
> ---------------------------------------------------------------------------------------------
>
> Key: FLINK-29339
> URL: https://issues.apache.org/jira/browse/FLINK-29339
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Coordination
> Affects Versions: 1.16.0
> Reporter: Chesnay Schepler
> Assignee: Xuannan Su
> Priority: Blocker
> Labels: pull-request-available
> Fix For: 1.16.0
>
>
> {code:java}
> private List<ShuffleDescriptor> requestShuffleDescriptorsFromResourceManager(
> IntermediateDataSetID intermediateDataSetID) {
> Preconditions.checkNotNull(
> resourceManagerGateway, "JobMaster is not connected to
> ResourceManager");
> try {
> return this.resourceManagerGateway
> .getClusterPartitionsShuffleDescriptors(intermediateDataSetID)
> .get(); // <-- there's your problem
> } catch (Throwable e) {
> throw new RuntimeException(
> String.format(
> "Failed to get shuffle descriptors of intermediate
> dataset %s from ResourceManager",
> intermediateDataSetID),
> e);
> }
> }
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)