StephanEwen commented on issue #9058: [FLINK-13166] Add support for batch slot 
requests to SlotPoolImpl
URL: https://github.com/apache/flink/pull/9058#issuecomment-509988368
 
 
   I agree with Till here. The logic is not yet perfect, but should be an 
improvement over the current state.
   
   Under fine-grained recovery, the current state would lead to failure of a 
task and individual recovery, re-triggering a request to the RM. That is good, 
but the downside is that it takes away recovery attempts. I think this is 
tricky for users to understand, that we rely on failure / recovery to 
re-request resources. It makes re-try attempts meaningless and brings users to 
debug jobs (because they see unexpected failures) when really nothing is wrong.
   
   With this change here, we don't rely on failure/recovery any more, but do 
not re-trigger timed out requests within a stage. It may hence be that a stage 
does not optimally use its resources. Requests come again in the next stage.
   
   Like Till suggested, for 1.10, we should consider a different model. 
Requests from the SlotPool to the RM should not time out (unless there is an 
actual failure) and resources that appear at the RM make it to the SlotPool. 
Letting the SlotPool periodically request resources seems like a workaround to 
me.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to