[jira] [Commented] (FLINK-29339) JobMasterPartitionTrackerImpl#requestShuffleDescriptorsFromResourceManager blocks main thread

Chesnay Schepler (Jira) Wed, 28 Sep 2022 02:02:07 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-29339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17610447#comment-17610447
 ]


Chesnay Schepler commented on FLINK-29339:
------------------------------------------

> the actual numbers of the rpc request is limited.

The numbers are irrelevant though, as is the size of the response.
You just need a single request to time out (e.g., due to the RM actor crashing) 
to potentially crash the entire cluster because of heartbeat timeouts.

You make a good point that the fix is quite involved though, and it is indeed 
only called if cluster partitions are actually used. So I'm fine with not 
treating it as a blocker.

> JobMasterPartitionTrackerImpl#requestShuffleDescriptorsFromResourceManager 
> blocks main thread
> ---------------------------------------------------------------------------------------------
>
>                 Key: FLINK-29339
>                 URL: https://issues.apache.org/jira/browse/FLINK-29339
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.16.0
>            Reporter: Chesnay Schepler
>            Assignee: Xuannan Su
>            Priority: Blocker
>              Labels: pull-request-available
>             Fix For: 1.16.0
>
>
> {code:java}
> private List<ShuffleDescriptor> requestShuffleDescriptorsFromResourceManager(
>         IntermediateDataSetID intermediateDataSetID) {
>     Preconditions.checkNotNull(
>             resourceManagerGateway, "JobMaster is not connected to 
> ResourceManager");
>     try {
>         return this.resourceManagerGateway
>                 .getClusterPartitionsShuffleDescriptors(intermediateDataSetID)
>                 .get(); // <-- there's your problem
>     } catch (Throwable e) {
>         throw new RuntimeException(
>                 String.format(
>                         "Failed to get shuffle descriptors of intermediate 
> dataset %s from ResourceManager",
>                         intermediateDataSetID),
>                 e);
>     }
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-29339) JobMasterPartitionTrackerImpl#requestShuffleDescriptorsFromResourceManager blocks main thread

Reply via email to