[
https://issues.apache.org/jira/browse/FLINK-13163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16882022#comment-16882022
]
Ken Krugler commented on FLINK-13163:
-------------------------------------
Hi [~zhuzh] - thanks for this report, and the notes. I've found that in my
batch jobs, limiting source parallelism seems to help reduce the number of
failures. Is there a way to determine (via logs) whether my issue(s) are
related?
> Support execution of batch jobs with fewer slots than requested
> ---------------------------------------------------------------
>
> Key: FLINK-13163
> URL: https://issues.apache.org/jira/browse/FLINK-13163
> Project: Flink
> Issue Type: Improvement
> Components: Runtime / Coordination
> Affects Versions: 1.9.0
> Reporter: Jeff Zhang
> Assignee: Till Rohrmann
> Priority: Major
> Fix For: 1.9.0
>
>
> Flink should be able to execute batch jobs with fewer slots than requested in
> a sequential manner.
> At the moment, however, we register for every slot request a timeout which
> fires after {{slot.request.timeout}} to fail the slot request. Moreover, we
> fail the slot request if the {{ResourceManager}} fails to allocate new
> resources or if the slot request times out on the {{ResourceManager}}. This
> kind of behavior makes sense if we know that we need all requested slots so
> that we fail early if it cannot be fulfilled.
> However, for batch jobs it is not strictly required that all slot requests
> get fulfilled. It is enough to have at least one slot for every requested
> {{ResourceProfile}} (the set of slots (available + allocated) must contain a
> slot which can fulfill a slot request). If this is the case, then we should
> not fail the slot request but instead wait until the slot gets assigned to
> the request. If there is no such slot, then we should still time out the
> request.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)