wilfred-s commented on PR #541:
URL: https://github.com/apache/yunikorn-k8shim/pull/541#issuecomment-1445809081

   The comment in `postAppAccepted()` says that we skip reserving and directly 
go through to running under certain circumstances. It is the only point we 
generate a `TryReserve` event which could have us end up in `onReserving()` . 
   In the case we saw we had all placeholder pods and the originating Spark 
driver pod as part of the recovery. 
   
   In `OnReserving()` we create the placeholders. We thus would try to create 
all the placeholders from scratch. We would just log "placeholder exists" for 
most or not all of them. Since we could have crashed half way through creating 
the placeholders I think that it is OK to retry the creation part. Placeholder 
timeout should kick in correctly and cleanup anyway. 
   
   Back to our issue: the originator pod (driver) and 1 placeholder were not 
allocated. The change adds an extra call to make sure the application 
progresses. That only works if all placeholders are Bound to a node when the 
transition is made. I think that call is correct, but does not work by itself 
to fix the case we saw.
   The allocation of that last placeholder did not trigger the application 
change as expected. After 15 minutes all placeholders timed out and we left the 
originating driver just sitting there.
   The recovered tasks, all the placeholders, were initialised with the state 
`Allocated`. Unless they moved to a state `Bound` as part of the recovery the 
check in `onReservationStateChange()` would have failed. The change you stated 
as "more sound" is the main fix I think.
   
   Not sure if we can separately test that one change...
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to