Re: [PR] Add `ResumableJobMixin` with `SparkSubmitOperator` as a case study for surviving worker failures (standalone) [airflow]

via GitHub Wed, 27 May 2026 22:36:49 -0700


shahar1 commented on code in PR #67118:
URL: https://github.com/apache/airflow/pull/67118#discussion_r3315548832



##########
providers/apache/spark/docs/operators.rst:
##########
@@ -181,3 +181,24 @@ Reference
 """""""""
 
 For further information, look at `Apache Spark submitting applications 
<https://spark.apache.org/docs/latest/submitting-applications.html>`_.
+
+Cluster mode crash recovery (Spark standalone)
+"""""""""""""""""""""""""""""""""""""""""""""""
+
+When running in Spark standalone cluster mode (``--deploy-mode cluster``), the 
Spark driver runs
+independently on the cluster. If the Airflow worker dies while the Spark job 
is running, the driver keeps running but
+Airflow loses track of it and the behaviour to submit a brand new job would be 
wasting
+the compute already done.
+
+Now, the ``SparkSubmitOperator`` solves this by persisting the driver ID to 
``task_state`` immediately after
+submission. On retry, it reads the ID back and reconnects to the 
already-running driver instead of
+resubmitting.
+
+This is the **synchronous path** — the worker holds a slot for the duration of 
polling. This is
+intentional for teams that prefer sync operators for log observability, org 
constraints, or
+because a Triggerer is not available. It is not a replacement for deferrable 
operators; the two
+approaches are complementary.

Review Comment:
   > Resolving for now, can revisit if disagreement arises
   
   Sorry for the delay, can live with it :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Add `ResumableJobMixin` with `SparkSubmitOperator` as a case study for surviving worker failures (standalone) [airflow]

Reply via email to