amoghrajesh commented on code in PR #68067:
URL: https://github.com/apache/airflow/pull/68067#discussion_r3393302671
##########
providers/apache/spark/src/airflow/providers/apache/spark/operators/spark_submit.py:
##########
@@ -397,8 +441,19 @@ def poll_until_complete(self, external_id: JsonValue,
context: Context) -> None:
self._hook._run_post_submit_commands()
return
if self._hook._is_kubernetes:
- # TODO: poll K8s pod phase until terminal
- raise NotImplementedError("K8s poll not yet implemented")
+ if external_id is not None:
+ _, pod_name = str(external_id).split(":", 1)
+ self._hook._kubernetes_driver_pod = pod_name
+ self._hook._poll_k8s_driver_via_api()
+ # The driver pod is deleted on success, so cache the terminal
phase before it
+ # disappears. Failed jobs raise before reaching here, so only
"Succeeded" is ever
+ # cached. A missing key on retry means the pod was garbage
collected after failure, and
+ # resubmitting fresh is the right behaviour in that case.
+ task_store = context.get("task_store")
+ if task_store is not None:
+ task_store.set(self._K8S_DRIVER_STATUS_KEY, "Succeeded")
Review Comment:
Just checked and you are right. We have a comment at the call site which
says "Failed jobs raise before reaching here, so only 'Succeeded' is ever
cached", but that misses the 404 path, which also returns without raising (when
on_kill deletes the pod mid-run). That would incorrectly cache "Succeeded" for
a killed job and cause the next retry to skip resubmission entirely.
I am changing `_poll_k8s_driver_via_api()` to return the terminal phase
(`str | None` — `None` on the 404 path), and only write to task_store when the
return value is `"Succeeded"`.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]