Re: [PR] The task is stuck in a queued state forever in case of pod launch errors [airflow]

via GitHub Sun, 28 Jan 2024 23:21:37 -0800


hterik commented on code in PR #36882:
URL: https://github.com/apache/airflow/pull/36882#discussion_r1469164669



##########
airflow/providers/cncf/kubernetes/executors/kubernetes_executor.py:
##########
@@ -434,9 +434,9 @@ def sync(self) -> None:
                     )
                     self.fail(task[0], e)
                 except ApiException as e:
-                    # These codes indicate something is wrong with pod 
definition; otherwise we assume pod
-                    # definition is ok, and that retrying may work
-                    if e.status in (400, 422):
+                    # In case of the below error codes, fail the task and 
honor the task retires.
+                    # Otherwise, go for continuous/infinite retries.
+                    if e.status in (400, 403, 404, 422):

Review Comment:
   How does this compare to just increasing your quota in kubernetes and let 
the K8s scheduler handle all the queuing? Pods can be in queued state without 
using any node compute.
   Stacking queues and schedulers on top of each other feels like just pushing 
the demand elsewhere.
   
   For example, if retry is implemented in airflow, will the tasks be queued 
fairly into kubernetes or will new tasks get precedence? 
   Eg 
   t1: Task 2 attempts to start. Quota is full.  Fails -> Backoff 5 min
   t2: Task 1 completes -> Quota is now available
   t3: Task 3: Starts -> Succeeds because no quota limitation any more
   t4: Task 2 retries, quota is full again, because Task3 squeezed in before. 
Not fair because it should have gone before Task 3.
   
   
   To be honest i'm not very familiar with quotas, it seems like something that 
is more useful for servers that are running all the time, not batch-oriented 
workflow like airflow tasks that start, stop all the time and where relying on 
the built in queue may be more appropriate? Maybe others have valid use cases 
for it i'm not aware of.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] The task is stuck in a queued state forever in case of pod launch errors [airflow]

Reply via email to