andygrove commented on code in PR #1903:
URL:
https://github.com/apache/datafusion-ballista/pull/1903#discussion_r3482883527
##########
ballista/scheduler/src/state/execution_graph.rs:
##########
@@ -219,6 +219,11 @@ pub trait ExecutionGraph: Debug {
/// fail job with error message
fn fail_job(&mut self, error: String);
+ /// Abort a running job: fail it, transition every running stage to Failed,
+ /// and return the in-flight tasks that should be cancelled. Used for both
the
+ /// failure and cancellation teardown paths.
+ fn abort_running(&mut self, error: String) -> Vec<RunningTaskInfo>;
Review Comment:
On your open question about the default trait method versus two per-impl
copies: I'd lean toward the default method. The body only calls
`running_tasks`, `fail_job`, `running_stages`, and `fail_stage`, which are all
methods on this trait, so a default implementation here should compile and let
both `StaticExecutionGraph` and `AdaptiveExecutionGraph` drop their identical
copies. That keeps the teardown logic in one place so the two graph types can't
drift apart later. The lack of an existing default method in the trait is fine,
this seems like a reasonable first one. Happy either way though if you'd rather
keep them explicit.
##########
ballista/scheduler/src/state/execution_stage.rs:
##########
@@ -554,15 +554,40 @@ impl RunningStage {
}
}
- /// Converts this running stage to a failed stage with the given error
message.
+ /// Converts this running stage to a failed stage. Still-running tasks are
recorded
+ /// as cancelled (`Failed(TaskKilled)`) since failing the stage cancels
them.
pub fn to_failed(&self, error_message: String) -> FailedStage {
+ let task_infos = self
+ .task_infos
+ .iter()
+ .map(|task_info| {
+ task_info.as_ref().map(|info| {
+ if matches!(info.task_status,
task_status::Status::Running(_)) {
+ TaskInfo {
+ task_status:
task_status::Status::Failed(FailedTask {
+ error: "cancelled".to_string(),
Review Comment:
Small thing. Since `abort_running` is shared between the failure and
cancellation paths, a task killed because the job *failed* (rather than being
cancelled by a user) will still get `error: "cancelled"`. The `TaskKilled`
reason is accurate either way, so this is minor, but would it be worth
threading the actual reason through, or using something more neutral like
"killed"? Not a blocker, just a clarity thought for anyone reading task
statuses later.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]