[jira] [Commented] (FLINK-29339) JobMasterPartitionTrackerImpl#requestShuffleDescriptorsFromResourceManager blocks main thread

Yun Gao (Jira) Wed, 28 Sep 2022 20:10:57 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-29339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17610779#comment-17610779
 ]


Yun Gao commented on FLINK-29339:
---------------------------------

> The numbers are irrelevant though, as is the size of the response.
You just need a single request to time out (e.g., due to the RM actor crashing) 
to potentially crash the entire cluster because of heartbeat timeouts.

Thanks a lot for the clarification. We'll try to avoid the synchronization 
calls in the future development.

Then let's postpone the fix till the beginning of the next version.

> JobMasterPartitionTrackerImpl#requestShuffleDescriptorsFromResourceManager 
> blocks main thread
> ---------------------------------------------------------------------------------------------
>
>                 Key: FLINK-29339
>                 URL: https://issues.apache.org/jira/browse/FLINK-29339
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.16.0
>            Reporter: Chesnay Schepler
>            Assignee: Xuannan Su
>            Priority: Critical
>              Labels: pull-request-available
>             Fix For: 1.16.0
>
>
> {code:java}
> private List<ShuffleDescriptor> requestShuffleDescriptorsFromResourceManager(
>         IntermediateDataSetID intermediateDataSetID) {
>     Preconditions.checkNotNull(
>             resourceManagerGateway, "JobMaster is not connected to 
> ResourceManager");
>     try {
>         return this.resourceManagerGateway
>                 .getClusterPartitionsShuffleDescriptors(intermediateDataSetID)
>                 .get(); // <-- there's your problem
>     } catch (Throwable e) {
>         throw new RuntimeException(
>                 String.format(
>                         "Failed to get shuffle descriptors of intermediate 
> dataset %s from ResourceManager",
>                         intermediateDataSetID),
>                 e);
>     }
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-29339) JobMasterPartitionTrackerImpl#requestShuffleDescriptorsFromResourceManager blocks main thread

Reply via email to