[jira] [Commented] (FLINK-15031) Automatically calculate required network memory for fine-grained jobs

Yangze Guo (Jira) Tue, 29 Jun 2021 02:00:05 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-15031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17371227#comment-17371227
 ]


Yangze Guo commented on FLINK-15031:
------------------------------------

{quote}Is the problem here that the {{TaskManager}} is started pre-configured 
and we somehow need to recompute this value on the {{ResourceManager}}? If the 
{{TaskManager}} would just offer a pool of network memory and we could cut off 
a part as we request a new slot, then we could simply say that this slot should 
be started with this network memory configuration.
{quote}
I may not fully understand your question. If we want to apply the automatically 
calculated network memory to UNKNOWN resource, then we need to introduce a new 
type of ResourceProfile, which has a specific network memory and all other 
fields are UNKNOWN. As it may introduce much system complexity, we tend to 
limit the scope of this ticket to fine-grained resource requirements as a first 
step. Regarding the UNKNOWN requirement, the network memory it gained at 
runtime will depend on its located TaskManager's configuration.

> Automatically calculate required network memory for fine-grained jobs
> ---------------------------------------------------------------------
>
>                 Key: FLINK-15031
>                 URL: https://issues.apache.org/jira/browse/FLINK-15031
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Coordination
>    Affects Versions: 1.10.0
>            Reporter: Zhu Zhu
>            Assignee: Jin Xing
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 1.12.0
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> In cases where resources are specified, we expect each operator to declare 
> required resources before using them. In this way, no resource related error 
> should happen if resources are not used beyond what was declared. This 
> ensures a deployed task would not fail due to insufficient resources in TM, 
> which may result in unnecessary failures and may even cause a job hanging 
> forever, failing repeatedly on deploying tasks to a TM with insufficient 
> resources.
> Shuffle memory is the last missing piece for this goal at the moment. Minimum 
> network buffers are required by tasks to work. Currently a task is possible 
> to be deployed to a TM with insufficient network buffers, and fails on 
> launching.
> To avoid that, we should calculate required network memory for a 
> task/SlotSharingGroup before allocating a slot for it.
> The required shuffle memory can be derived from the number of required 
> network buffers. The number of buffers required by a task (ExecutionVertex) is
> {code:java}
> exclusive buffers for input channels(i.e. numInputChannel * 
> buffersPerChannel) + required buffers for result partition buffer 
> pool(currently is numberOfSubpartitions + 1)
> {code}
> Note that this is for the {{NettyShuffleService}} case. For custom shuffle 
> services, currently there is no way to get the required shuffle memory of a 
> task.
> To make it simple under dynamic slot sharing, the required shuffle memory for 
> a task should be the max required shuffle memory of all {{ExecutionVertex}} 
> of the same {{ExecutionJobVertex}}. And the required shuffle memory for a 
> slot sharing group should be the sum of shuffle memory for each 
> {{ExecutionJobVertex}} instance within.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-15031) Automatically calculate required network memory for fine-grained jobs

Reply via email to