[ 
https://issues.apache.org/jira/browse/FLINK-13163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16881704#comment-16881704
 ] 

Zhu Zhu edited comment on FLINK-13163 at 7/10/19 3:51 AM:
----------------------------------------------------------

Hi [~till.rohrmann], currently a source can always get input splits assignment 
from the input assigner when there is any remaining input splits. In the case 
that the slots is fewer than source task parallelism, it can lead to that the 
first launched source tasks process all the input splits and after they 
finishes, the later launched source tasks will get no input split to process. 
As describes in FLINK-12138.

This may lead to 2 issues:
 # a source task failure causes more regression than expected
 # the efforts to reduce source subtask load by increasing parallelism does not 
work. a source task can be overwhelmed by too much data

Do you think we need a fix for this to support batch jobs with fewer slots?


was (Author: zhuzh):
Hi [~till.rohrmann], currently a source can always get input splits assignment 
from the input assigner when there is any remaining input splits. In the case 
that the slots is fewer than source task parallelism, it can lead to that the 
first launched source tasks process all the input splits and after they 
finishes, the later launched source tasks will get no input split to process. 
As describes in [FLINK-12138|https://issues.apache.org/jira/browse/FLINK-12138].

This may lead to 2 issues:
 # a source task failure causes regression than expected
 # the effort reduce source subtask load by increasing parallelism does not work

Do you think we need a fix for this to support batch jobs with fewer slots?

> Support execution of batch jobs with fewer slots than requested
> ---------------------------------------------------------------
>
>                 Key: FLINK-13163
>                 URL: https://issues.apache.org/jira/browse/FLINK-13163
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Coordination
>    Affects Versions: 1.9.0
>            Reporter: Jeff Zhang
>            Assignee: Till Rohrmann
>            Priority: Major
>             Fix For: 1.9.0
>
>
> Flink should be able to execute batch jobs with fewer slots than requested in 
> a sequential manner.
> At the moment, however, we register for every slot request a timeout which 
> fires after {{slot.request.timeout}} to fail the slot request. Moreover, we 
> fail the slot request if the {{ResourceManager}} fails to allocate new 
> resources or if the slot request times out on the {{ResourceManager}}. This 
> kind of behavior makes sense if we know that we need all requested slots so 
> that we fail early if it cannot be fulfilled.
> However, for batch jobs it is not strictly required that all slot requests 
> get fulfilled. It is enough to have at least one slot for every requested 
> {{ResourceProfile}} (the set of slots (available + allocated) must contain a 
> slot which can fulfill a slot request). If this is the case, then we should 
> not fail the slot request but instead wait until the slot gets assigned to 
> the request. If there is no such slot, then we should still time out the 
> request.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to