wilfred-s commented on PR #890:
URL: https://github.com/apache/yunikorn-k8shim/pull/890#issuecomment-2290588309

   Not sure I agree with the direction
   
   > Another thing can be a more generic allocation retry where a failed 
volume/pod binding does not result in a failed Task. Instead, we cancel the 
allocation from the shim and let the core re-schedule it at a later time.
   
   I think that that is the only correct way to handle this. It is a larger 
change but anything else is a simple bandaid.
   
   We also already have the option to increase the bind timeout via the config 
so wrapping the retry that is already in the binder again is not a good idea. 
If it takes too long increase the timeout. The check is run every second so 
increasing the configured timeout from 10s to 30s will only affect these 
failure cases.
   
   The documentation for the BindPodVolumes call shows:
   ```
   //     i.  BindPodVolumes() is called first in PreBind phase. It makes all 
the necessary API updates and waits for
   //     PV controller to fully bind and provision the PVCs. If binding fails, 
the Pod is sent
   //     back through the scheduler.
   ```
   So even the default scheduler just dumps it back into the scheduling cycle 
and retries if after the timeout it has failed.
   Looking at the code the reason for the error might be something that cannot 
be solved. For instance the node selected for the pod might not work for the 
volume.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to