[
https://issues.apache.org/jira/browse/FLINK-31144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17691916#comment-17691916
]
Zhu Zhu commented on FLINK-31144:
---------------------------------
Thanks for reporting this issue! [~jto]
What's the the parallelism of the job vertices? Unfortunately I cannot tell it
from the attached image.
IIUC, the problem is caused by vertices with large parallelism and ALL-to-ALL
edges. If so, changing the `if` condition of [this
logic|https://github.com/apache/flink/blob/43f419d0eccba86ecc8040fa6f521148f1e358ff/flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/DefaultPreferredLocationsRetriever.java#L98-L102)]
to {{consumedPartitionGroup.size() > MAX_DISTINCT_CONSUMERS_TO_CONSIDER}} may
help. Would you give a try?
> Slow scheduling on large-scale batch jobs
> ------------------------------------------
>
> Key: FLINK-31144
> URL: https://issues.apache.org/jira/browse/FLINK-31144
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Coordination
> Reporter: Julien Tournay
> Priority: Major
> Attachments: flink-1.17-snapshot-1676473798013.nps,
> image-2023-02-21-10-29-49-388.png
>
>
> When executing a complex job graph at high parallelism
> `DefaultPreferredLocationsRetriever.getPreferredLocationsBasedOnInputs` can
> get slow and cause long pauses where the JobManager becomes unresponsive and
> all the taskmanagers just wait. I've attached a VisualVM snapshot to
> illustrate the problem.[^flink-1.17-snapshot-1676473798013.nps]
> At Spotify we have complex jobs where this issue can cause batch "pause" of
> 40+ minutes and make the overall execution 30% slower or more.
> More importantly this prevent us from running said jobs on larger cluster as
> adding resources to the cluster worsen the issue.
> We have successfully tested a modified Flink version where
> `DefaultPreferredLocationsRetriever.getPreferredLocationsBasedOnInputs` was
> completely commented and simply returns an empty collection and confirmed it
> solves the issue.
> In the same spirit as a recent change
> ([https://github.com/apache/flink/blob/43f419d0eccba86ecc8040fa6f521148f1e358ff/flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/DefaultPreferredLocationsRetriever.java#L98-L102)]
> there could be a mechanism in place to detect when Flink run into this
> specific issue and just skip the call to `getInputLocationFutures`
> [https://github.com/apache/flink/blob/43f419d0eccba86ecc8040fa6f521148f1e358ff/flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/DefaultPreferredLocationsRetriever.java#L105-L108.]
> I'm not familiar enough with the internals of Flink to propose a more
> advanced fix, however it seems like a configurable threshold on the number of
> consumer vertices above which the preferred location is not computed would
> do. If this solution is good enough, I'd be happy to submit a PR.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)