[ 
https://issues.apache.org/jira/browse/FLINK-13163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-13163:
----------------------------------
    Description: 
Flink should be able to execute batch jobs with fewer slots than requested in a 
sequential manner.

At the moment, however, we register for every slot request a timeout which 
fires after {{slot.request.timeout}} to fail the slot request. Moreover, we 
fail the slot request if the {{ResourceManager}} fails to allocate new 
resources or if the slot request times out on the {{ResourceManager}}. This 
kind of behavior makes sense if we know that we need all requested slots so 
that we fail early if it cannot be fulfilled.

However, for batch jobs it is not strictly required that all slot requests get 
fulfilled. It is enough to have at least one slot for every requested 
{{ResourceProfile}} (the set of slots (available + allocated) must contain a 
slot which can fulfill a slot request). If this is the case, then we should not 
fail the slot request but instead wait until the slot gets assigned to the 
request.

  was:The default value of slot.request.timeout is 5 minutes. It will cause the 
flink job fail if downstream vertex can not get resources in 5 minutes. 
Ideally, for batch job, it should wait there indefinitely. 


> Support execution of batch jobs with fewer slots than requested
> ---------------------------------------------------------------
>
>                 Key: FLINK-13163
>                 URL: https://issues.apache.org/jira/browse/FLINK-13163
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Coordination
>    Affects Versions: 1.9.0
>            Reporter: Jeff Zhang
>            Priority: Major
>             Fix For: 1.9.0
>
>
> Flink should be able to execute batch jobs with fewer slots than requested in 
> a sequential manner.
> At the moment, however, we register for every slot request a timeout which 
> fires after {{slot.request.timeout}} to fail the slot request. Moreover, we 
> fail the slot request if the {{ResourceManager}} fails to allocate new 
> resources or if the slot request times out on the {{ResourceManager}}. This 
> kind of behavior makes sense if we know that we need all requested slots so 
> that we fail early if it cannot be fulfilled.
> However, for batch jobs it is not strictly required that all slot requests 
> get fulfilled. It is enough to have at least one slot for every requested 
> {{ResourceProfile}} (the set of slots (available + allocated) must contain a 
> slot which can fulfill a slot request). If this is the case, then we should 
> not fail the slot request but instead wait until the slot gets assigned to 
> the request.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to