[ https://issues.apache.org/jira/browse/FLINK-13163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16882022#comment-16882022 ]
Ken Krugler commented on FLINK-13163: ------------------------------------- Hi [~zhuzh] - thanks for this report, and the notes. I've found that in my batch jobs, limiting source parallelism seems to help reduce the number of failures. Is there a way to determine (via logs) whether my issue(s) are related? > Support execution of batch jobs with fewer slots than requested > --------------------------------------------------------------- > > Key: FLINK-13163 > URL: https://issues.apache.org/jira/browse/FLINK-13163 > Project: Flink > Issue Type: Improvement > Components: Runtime / Coordination > Affects Versions: 1.9.0 > Reporter: Jeff Zhang > Assignee: Till Rohrmann > Priority: Major > Fix For: 1.9.0 > > > Flink should be able to execute batch jobs with fewer slots than requested in > a sequential manner. > At the moment, however, we register for every slot request a timeout which > fires after {{slot.request.timeout}} to fail the slot request. Moreover, we > fail the slot request if the {{ResourceManager}} fails to allocate new > resources or if the slot request times out on the {{ResourceManager}}. This > kind of behavior makes sense if we know that we need all requested slots so > that we fail early if it cannot be fulfilled. > However, for batch jobs it is not strictly required that all slot requests > get fulfilled. It is enough to have at least one slot for every requested > {{ResourceProfile}} (the set of slots (available + allocated) must contain a > slot which can fulfill a slot request). If this is the case, then we should > not fail the slot request but instead wait until the slot gets assigned to > the request. If there is no such slot, then we should still time out the > request. -- This message was sent by Atlassian JIRA (v7.6.3#76005)