Nataneljpwd commented on code in PR #61778:
URL: https://github.com/apache/airflow/pull/61778#discussion_r2802882809


##########
providers/cncf/kubernetes/src/airflow/providers/cncf/kubernetes/triggers/pod.py:
##########
@@ -183,7 +184,7 @@ async def run(self) -> AsyncIterator[TriggerEvent]:
                 event = await self._wait_for_container_completion()
             yield event
             return
-        except PodLaunchTimeoutException as e:
+        except (PodLaunchTimeoutException, PodLaunchFailedException) as e:

Review Comment:
   I think there should be some kind of a hard limit for the tries when pulling 
the image, as what if the image was deleted? we do not want to keep retrying 
forever, and as of now, it looks like if we have a corrupt image, we will retry 
indefinitely.
   That is, at least, what I see that will happen now.



##########
providers/cncf/kubernetes/src/airflow/providers/cncf/kubernetes/operators/pod.py:
##########
@@ -958,6 +959,12 @@ def trigger_reentry(self, context: Context, event: 
dict[str, Any]) -> Any:
             follow = self.logging_interval is None
             last_log_time = event.get("last_log_time")
 
+            if event["status"] == "timeout":
+                pod_phase = self.pod.status.phase if self.pod.status and 
self.pod.status.phase else None
+                if pod_phase in {PodPhase.RUNNING, *PodPhase.terminal_states}:
+                    self.log.info("Pod has transitioned from pending state 
after timeout, deferring again")
+                    self.invoke_defer_method(last_log_time=last_log_time, 
context=context)

Review Comment:
   Great!
   you may resolve the comment



##########
providers/cncf/kubernetes/src/airflow/providers/cncf/kubernetes/operators/pod.py:
##########
@@ -946,6 +946,7 @@ def trigger_reentry(self, context: Context, event: 
dict[str, Any]) -> Any:
         """
         self.pod = None
         xcom_sidecar_output = None
+        skip_cleanup = False

Review Comment:
   If we skip cleanup only when the task is either deferred or when the status 
is a timeout, I think we can just check that again instead of adding a flag, it 
will make the code easier to follow in my opinion, what do you think?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to