johnhoran commented on code in PR #61778:
URL: https://github.com/apache/airflow/pull/61778#discussion_r2795795782
##########
providers/cncf/kubernetes/src/airflow/providers/cncf/kubernetes/triggers/pod.py:
##########
@@ -183,7 +184,7 @@ async def run(self) -> AsyncIterator[TriggerEvent]:
event = await self._wait_for_container_completion()
yield event
return
- except PodLaunchTimeoutException as e:
+ except (PodLaunchTimeoutException, PodLaunchFailedException) as e:
Review Comment:
I thought about doing that but I don't think it fully fixes the problem.
There is always going to be a period of time when you are handing back from the
triggerer to the operator and for any of the transient reasons the pod failed,
like failing to pull the image or failure to spin up a node in time, the
possibility exists that the pod will have started by the time the operator
takes over.
The other reason I felt it was better not to add extra logic looking for
rate limiting was because
https://github.com/apache/airflow/blob/1c41180381a459b77b6d964229bdc19a4a7ec0b3/providers/cncf/kubernetes/src/airflow/providers/cncf/kubernetes/utils/pod_manager.py#L187-L205
is looking for both `ErrImagePull` and `ImagePullBackOff`. Depending on how
long it takes the triggerer to take over, I might only see one of those states
and it might not give as verbose a reason for it being in ImagePullBackOff. If
you feel strongly its worth doing, maybe I could look at pulling kubernetes
events...
Anyway I marked the PR back to draft because I think I also need to account
for the pod starting and terminating while waiting for the operator.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]