Re: [PR] feat: cancel running stages and tasks on job failure or cancellation [datafusion-ballista]

via GitHub Fri, 26 Jun 2026 09:56:05 -0700


andygrove commented on code in PR #1903:
URL: 
https://github.com/apache/datafusion-ballista/pull/1903#discussion_r3482883527



##########
ballista/scheduler/src/state/execution_graph.rs:
##########
@@ -219,6 +219,11 @@ pub trait ExecutionGraph: Debug {
     /// fail job with error message
     fn fail_job(&mut self, error: String);
 
+    /// Abort a running job: fail it, transition every running stage to Failed,
+    /// and return the in-flight tasks that should be cancelled. Used for both 
the
+    /// failure and cancellation teardown paths.
+    fn abort_running(&mut self, error: String) -> Vec<RunningTaskInfo>;

Review Comment:
   On your open question about the default trait method versus two per-impl 
copies: I'd lean toward the default method. The body only calls 
`running_tasks`, `fail_job`, `running_stages`, and `fail_stage`, which are all 
methods on this trait, so a default implementation here should compile and let 
both `StaticExecutionGraph` and `AdaptiveExecutionGraph` drop their identical 
copies. That keeps the teardown logic in one place so the two graph types can't 
drift apart later. The lack of an existing default method in the trait is fine, 
this seems like a reasonable first one. Happy either way though if you'd rather 
keep them explicit.



##########
ballista/scheduler/src/state/execution_stage.rs:
##########
@@ -554,15 +554,40 @@ impl RunningStage {
         }
     }
 
-    /// Converts this running stage to a failed stage with the given error 
message.
+    /// Converts this running stage to a failed stage. Still-running tasks are 
recorded
+    /// as cancelled (`Failed(TaskKilled)`) since failing the stage cancels 
them.
     pub fn to_failed(&self, error_message: String) -> FailedStage {
+        let task_infos = self
+            .task_infos
+            .iter()
+            .map(|task_info| {
+                task_info.as_ref().map(|info| {
+                    if matches!(info.task_status, 
task_status::Status::Running(_)) {
+                        TaskInfo {
+                            task_status: 
task_status::Status::Failed(FailedTask {
+                                error: "cancelled".to_string(),

Review Comment:
   Small thing. Since `abort_running` is shared between the failure and 
cancellation paths, a task killed because the job *failed* (rather than being 
cancelled by a user) will still get `error: "cancelled"`. The `TaskKilled` 
reason is accurate either way, so this is minor, but would it be worth 
threading the actual reason through, or using something more neutral like 
"killed"? Not a blocker, just a clarity thought for anyone reading task 
statuses later.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] feat: cancel running stages and tasks on job failure or cancellation [datafusion-ballista]

Reply via email to