[
https://issues.apache.org/jira/browse/FLINK-29339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17610779#comment-17610779
]
Yun Gao commented on FLINK-29339:
---------------------------------
> The numbers are irrelevant though, as is the size of the response.
You just need a single request to time out (e.g., due to the RM actor crashing)
to potentially crash the entire cluster because of heartbeat timeouts.
Thanks a lot for the clarification. We'll try to avoid the synchronization
calls in the future development.
Then let's postpone the fix till the beginning of the next version.
> JobMasterPartitionTrackerImpl#requestShuffleDescriptorsFromResourceManager
> blocks main thread
> ---------------------------------------------------------------------------------------------
>
> Key: FLINK-29339
> URL: https://issues.apache.org/jira/browse/FLINK-29339
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Coordination
> Affects Versions: 1.16.0
> Reporter: Chesnay Schepler
> Assignee: Xuannan Su
> Priority: Critical
> Labels: pull-request-available
> Fix For: 1.16.0
>
>
> {code:java}
> private List<ShuffleDescriptor> requestShuffleDescriptorsFromResourceManager(
> IntermediateDataSetID intermediateDataSetID) {
> Preconditions.checkNotNull(
> resourceManagerGateway, "JobMaster is not connected to
> ResourceManager");
> try {
> return this.resourceManagerGateway
> .getClusterPartitionsShuffleDescriptors(intermediateDataSetID)
> .get(); // <-- there's your problem
> } catch (Throwable e) {
> throw new RuntimeException(
> String.format(
> "Failed to get shuffle descriptors of intermediate
> dataset %s from ResourceManager",
> intermediateDataSetID),
> e);
> }
> }
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)