[ 
https://issues.apache.org/jira/browse/FLINK-15031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhu Zhu updated FLINK-15031:
----------------------------
    Description: 
In resources specified cases, we expect each operator to declare required 
resources and before using them. In this way, no resource related error should 
happen if no resource is used more than declared. This ensures a deployed task 
would not fail due to insufficient resources in TM. This may result in 
unnecessary failures and may even cause a job hanging forever, failing 
repeatedly on deploying tasks to a TM with insufficient resources.

Shuffle memory is the last missing piece for this goal at the moment. Minimum 
network buffers are required by tasks to work. Currently a task can be deployed 
to a TM with insufficient network buffers, and fails on launching.

To avoid that, we should calculate required network memory for a 
task/SlotSharingGroup and set the result in {{ResourceProfile}} before 
allocating a slot for it.



  was:
In resources specified cases, we expect the behavior pattern on resources to be 
declare and use. No resource related error should happen if no resource is used 
more than declared. This ensures a job to not fail when resources are limited. 

Shuffle memory is the last missing piece for this goal at the moment. Minimum 
network buffers are required by tasks to work. *Currently a task can be 
deployed to a TM with insufficient network buffers, and fails on launching.* 
This may result in unnecessary failures and may even cause a job hanging 
forever, failing repeatedly on deploying tasks to a TM with few network buffers.

To avoid that, we should calculate required network memory for a 
task/SlotSharingGroup before allocating a slot for it with the 
{{ResourceProfile}}.




> Calculate required shuffle memory cases before allocating slots in resources 
> specified
> --------------------------------------------------------------------------------------
>
>                 Key: FLINK-15031
>                 URL: https://issues.apache.org/jira/browse/FLINK-15031
>             Project: Flink
>          Issue Type: Sub-task
>          Components: Runtime / Coordination
>    Affects Versions: 1.10.0
>            Reporter: Zhu Zhu
>            Priority: Major
>             Fix For: 1.10.0
>
>
> In resources specified cases, we expect each operator to declare required 
> resources and before using them. In this way, no resource related error 
> should happen if no resource is used more than declared. This ensures a 
> deployed task would not fail due to insufficient resources in TM. This may 
> result in unnecessary failures and may even cause a job hanging forever, 
> failing repeatedly on deploying tasks to a TM with insufficient resources.
> Shuffle memory is the last missing piece for this goal at the moment. Minimum 
> network buffers are required by tasks to work. Currently a task can be 
> deployed to a TM with insufficient network buffers, and fails on launching.
> To avoid that, we should calculate required network memory for a 
> task/SlotSharingGroup and set the result in {{ResourceProfile}} before 
> allocating a slot for it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to