amoghrajesh opened a new issue, #67934:
URL: https://github.com/apache/airflow/issues/67934

   ## Problem
   
   `SparkSubmitOperator` with `track_driver_via_k8s_api=True` detects job 
completion by watching `pod.status.phase`. This breaks in two ways when the 
driver pod has sidecar containers:
   
   1. The pod phase stays `Running` after the driver container exits (because 
the sidecar is still alive), so the poll loop never sees `Succeeded` and waits 
indefinitely.
   2. On the `Failed` branch, `container_statuses[0]` is used to extract the 
exit code and reason — but index 0 is not guaranteed to be the driver container 
in a multi-container pod.
   
   ## When it occurs
   
   Only when Istio (or another sidecar container like fluentbut) is injected 
into the **driver pod**. 
   
   ## Proposed fix
   
   Filter `container_statuses` by the driver container name 
(`spark-kubernetes-driver` is the Spark default) instead of relying on 
`pod.status.phase` or index 0:
   
   - Treat the driver container's `state.terminated.exit_code == 0` as success.
   - Treat `exit_code != 0` as failure, with the actual exit code and reason in 
the error message.
   - Fall back to `pod.status.phase` if the container name is not found 
(defensive).
   
   The driver container name could be made configurable via a new 
`k8s_driver_container_name` parameter defaulting to `spark-kubernetes-driver`.
   
   ## Workaround
   
   Set `execution_timeout` on the operator. This is documented in the 
requirements section of the operator docs.
   
   ## Related
   
   Introduced in: #67715


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to