Re: [PR] Add `ResumableJobMixin` with `SparkSubmitOperator` as a case study for surviving worker failures [airflow]

via GitHub Tue, 19 May 2026 02:30:13 -0700


Alexhans commented on code in PR #67118:
URL: https://github.com/apache/airflow/pull/67118#discussion_r3265184734



##########
providers/apache/spark/src/airflow/providers/apache/spark/operators/spark_submit.py:
##########
@@ -198,8 +221,63 @@ def execute(self, context: Context) -> None:
             self.conf = 
inject_transport_information_into_spark_properties(self.conf, context)
         if self._hook is None:
             self._hook = self._get_hook()
+        if self._hook._should_track_driver_status:
+            return self.execute_resumable(context)

Review Comment:
   +1 on the approach. We've used custom deferrable sensors extensively, but 
run GlueJobOperator synchronously in production.  We explored deferrable for 
compute operators but didn't fully adopt it (triggerer bottlenecks, didn't dive 
deep enough into reliability concerns). Either way, the worker crash recovery 
problem is orthogonal to deferrable, as you point out.
    
   Glue seems like a great case to test this as well, since it recently added 
[resume_glue_job_on_retry](https://github.com/apache/airflow/pull/59392) which 
aims to solve the same problem via XCom + scanning all job runs as fallback. 
Haven't tried it yet but plan to shortly. Would the intent be for operators 
like Glue to eventually adopt ResumableJobMixin and replace that ad-hoc 
mechanism ?
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Add `ResumableJobMixin` with `SparkSubmitOperator` as a case study for surviving worker failures [airflow]

Reply via email to