shahar1 commented on code in PR #67118: URL: https://github.com/apache/airflow/pull/67118#discussion_r3315548832
########## providers/apache/spark/docs/operators.rst: ########## @@ -181,3 +181,24 @@ Reference """"""""" For further information, look at `Apache Spark submitting applications <https://spark.apache.org/docs/latest/submitting-applications.html>`_. + +Cluster mode crash recovery (Spark standalone) +""""""""""""""""""""""""""""""""""""""""""""""" + +When running in Spark standalone cluster mode (``--deploy-mode cluster``), the Spark driver runs +independently on the cluster. If the Airflow worker dies while the Spark job is running, the driver keeps running but +Airflow loses track of it and the behaviour to submit a brand new job would be wasting +the compute already done. + +Now, the ``SparkSubmitOperator`` solves this by persisting the driver ID to ``task_state`` immediately after +submission. On retry, it reads the ID back and reconnects to the already-running driver instead of +resubmitting. + +This is the **synchronous path** — the worker holds a slot for the duration of polling. This is +intentional for teams that prefer sync operators for log observability, org constraints, or +because a Triggerer is not available. It is not a replacement for deferrable operators; the two +approaches are complementary. Review Comment: > Resolving for now, can revisit if disagreement arises Sorry for the delay, can live with it :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
