hussein-awala commented on code in PR #36882:
URL: https://github.com/apache/airflow/pull/36882#discussion_r1476561460


##########
airflow/providers/cncf/kubernetes/executors/kubernetes_executor.py:
##########
@@ -434,19 +438,35 @@ def sync(self) -> None:
                     )
                     self.fail(task[0], e)
                 except ApiException as e:
-                    # These codes indicate something is wrong with pod 
definition; otherwise we assume pod
-                    # definition is ok, and that retrying may work
-                    if e.status in (400, 422):
+                    body = json.loads(e.body)
+                    retries = self.task_publish_retries[key]
+                    # Fail the task in the following scenarios.
+                    # 1. kube api status code in (400, 404, 422)
+                    # 2. kube api status code is  403 and not related to 
exceeded quota
+                    # 3. task publish retries exhausted
+                    if (
+                        e.status in (400, 404, 422)
+                        or (e.status == 403 and "exceeded quota" not in 
body["message"])
+                        or not (
+                            self.task_publish_max_retries == -1 or retries < 
self.task_publish_max_retries

Review Comment:
   We can think positively and start with the second block:
   ```python
   if e.status == 403 and "exceeded quota" in body["message"] and 
(self.task_publish_max_retries == -1 or retries < self.task_publish_max_retries)
   ```
   In this case we will fail for all the other status code (if there are 
others) instead of retry.



##########
airflow/providers/cncf/kubernetes/provider.yaml:
##########
@@ -350,6 +350,15 @@ config:
         type: string
         example: ~
         default: ""
+      task_publish_max_retries:
+        description: |
+          The Maximum number of retries for queuing the task to the kubernetes 
scheduler when
+          failing due to Kube API transient errors before giving up and 
marking task as failed.

Review Comment:
   We don't need to retry for all errors, just the exceeded quota, +1 for 
applying my comment above and renaming the config.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to