Alexhans commented on code in PR #67118:
URL: https://github.com/apache/airflow/pull/67118#discussion_r3265184734
##########
providers/apache/spark/src/airflow/providers/apache/spark/operators/spark_submit.py:
##########
@@ -198,8 +221,63 @@ def execute(self, context: Context) -> None:
self.conf =
inject_transport_information_into_spark_properties(self.conf, context)
if self._hook is None:
self._hook = self._get_hook()
+ if self._hook._should_track_driver_status:
+ return self.execute_resumable(context)
Review Comment:
+1 on the approach. We've used custom deferrable sensors extensively, but
run GlueJobOperator synchronously in production. We explored deferrable for
compute operators but didn't fully adopt it (triggerer bottlenecks, didn't dive
deep enough into reliability concerns). Either way, the worker crash recovery
problem is orthogonal to deferrable, as you point out.
Glue seems like a great case to test this as well, since it recently added
[resume_glue_job_on_retry](https://github.com/apache/airflow/pull/59392) which
aims to solve the same problem via XCom + scanning all job runs as fallback.
Haven't tried it yet but plan to shortly. Would the intent be for operators
like Glue to eventually adopt ResumableJobMixin and replace that ad-hoc
mechanism ?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]