StephanEwen commented on issue #8740: [FLINK-12763][runtime] Fail job immediately if tasks’ resource needs can not be satisfied. URL: https://github.com/apache/flink/pull/8740#issuecomment-508656222 I think the difference between batch and streaming should not manifest in the ResourceManager. It can manifest in the scheduler, so let's see if we can cover this there. What do you think about this approach: - When the scheduler requests slots from the SlotPool, it uses a timeout. - For streaming, that is finite (you want a "NotEnoughResourcesAvailable" exception rather soon. - For batch, it is infinite, because the same slots can be used after another. - Failures from the ResourceManager to allocate a slot (timeout, whatsoever) only cancel the Future. But this is not propagated to the request from the scheduler. - Open issue: How to ensure that there is at least one slot of the relevant size Long Term Approach - We want to change the SlotPool such that you set something like `min: x slots of profile a and y slots of profile b` `preferred: k slots of profile a, i slots of profile b` - That is also the way to grow resources before triggering scaling in streaming auto scaling - In streaming, when "NotEnoughResourcesAvailable" exception comes, then we trigger auto-scale-down Short Term - Maybe we assume the minimum is always one - Slot pool requests do not fail as long as there is one slot of the desired resource profile.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
