StephanEwen commented on issue #8740: [FLINK-12763][runtime] Fail job 
immediately if tasks’ resource needs can not be satisfied.
URL: https://github.com/apache/flink/pull/8740#issuecomment-508479800
 
 
   I think that the problem you describe is a more general problem for the 
standalone resource manager.
   
   I standalone mode, it can take a long time until the "not enough resources" 
exception comes for streaming jobs, and for batch jobs the "no matching slot". 
So why don't we solve it in a more general way?
   
   I like the idea of a "startup period" in which the standalone RM waits for a 
longer timeout for TMs (and thus slots) to appear, and after that period slot 
requests are failed immediately if no free slot is readily available. That idea 
has floated around for a bit, maybe it is time to go for it.
   
   What I don't quite understand is the "mixed solution" in this PR that the 
startup period is used to discover what resource profiles are available. After 
that, requests still time out after a long time unless they request a resource 
profile that is incompatible with the ones seen during the startup period.
   
   I think this may lead to strange behavior:
     - TaskManagers that register late might not get used. You can start larger 
TMs later, they register, but slot requests still fail.
     - A profile might be available during the startup period, but the TMs shut 
down later, and the slot requests cannot be fulfilled any more. But the 
requests take a long time, because the resource profile was a known profile.
   
   All this becomes both easier and more consistent with a simple 
startup-period for the StandaloneResourceManager. After that, all fail 
immediately unless a slot is directly available.
   
   What do you think?
   
   BTW: This would be a change we need to discuss on dev/user mailing lists, 
because it changes system behavior. Probably most users would agree that it is 
for the better, but nonetheless, we need to be transparent there.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to