johnhoran commented on issue #59626:
URL: https://github.com/apache/airflow/issues/59626#issuecomment-4038324907

   > it's not ideal to retry entire task execution when only the pod read 
operation needs retrying.
   
   I'm fairly sure that the pod is permanently deleted at this point in the 
execution and any retries to read_pod will fail.  If we were dealing with a 
deployment then kubernetes would recreate the pod, but since this is a 
standalone pod, I believe the responsibility for recreating it will fall on the 
operator.  Preemption can also happen at any point during a pods lifecycle, 
though in if you are using an autoscaling cluster, exclusively for pods where 
requests == limits, then it should mostly occur during pod startup, even then 
its possible that daemonset deployment could be delayed so deployed pods could 
be some way into their execution.
   
   I think this would be best handled on the kubernetes side.  One possible way 
would be to use nodeaffinity so pods are only scheduled onto a node after the 
daemonsets are scheduled and some label set on their node.  You could also add 
priority classes to the pod, so pods that don't allow retries have higher 
priority.  If you wanted to be more aggressive with scheduling, then you could 
set the affinity taint only if the pod on its last try -- to that end I do 
think we should add max_tries from the task instance as an annotation to the 
pod.  Beyond the annotation, all of this is very cluster specific, so it would 
need to be handled either with kubernetes mutating webhooks or via callbacks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to