killzoner opened a new pull request, #1903:
URL: https://github.com/apache/datafusion-ballista/pull/1903

   # Which issue does this PR close?
   
   Closes https://github.com/apache/datafusion-ballista/issues/1795.
   
   # Rationale for this change
   
   When a job fails or is cancelled, only the job status was set to `Failed`. 
Its running stages and tasks stayed stuck in `Running` (`fail_stage` was never 
called).
   
   # What changes are included in this PR?
   
   - Add `ExecutionGraph::abort_running`, called from `task_manager::abort_job` 
(used by both the failure and cancel paths): it fails the job and runs 
`fail_stage` over the running stages.
   - `RunningStage::to_failed` marks still-running tasks as 
`Failed(TaskKilled)`.
   - Tests for both static and adaptive graphs.
   
   `CancelTasks` to executors stays best-effort (no protocol change).
   
   The `refactor: implement abort_running per ExecutionGraph impl` commit can 
be dropped if you prefer a default trait method over two per-impl copies.
   
   Tested locally with TPC-H under the chaos fault generator:
   
   ```bash
   cargo run --bin ballista-scheduler --features prometheus-metrics
   cargo run --bin ballista-executor
   cargo run --bin tpch -- benchmark ballista -p /path/to/tpch/sf10 -f parquet 
-i 1 \
     --port 50050 --host 127.0.0.1 \
     -c datafusion.execution.target_partitions=24 \
     -c ballista.planner.adaptive.enabled=true \
     -c ballista.testing.chaos_execution.enabled=true \
     -c ballista.testing.chaos_execution.fault_type=fatal \
     -c ballista.testing.chaos_execution.probability=0.6 \
     -q 1
   ```
   
   Failed job in the TUI:
   
   **Global view** (job shows `Failed`):
   <!-- drop screenshot here: Screenshot from 2026-06-26 17-03-39.png -->
   
   **Stage detail** (stage now shows `Failed`, was stuck `Running`):
   <!-- drop screenshot here: Screenshot from 2026-06-26 17-04-07.png -->
   
   **Stage tasks** (empty for a failed stage, surfaced in the follow-up):
   <!-- drop screenshot here: Screenshot from 2026-06-26 17-04-19.png -->
   
   # Are there any user-facing changes?
   
   Failed/cancelled jobs now show their stages as `Failed` instead of stuck 
`Running`.
   
   Follow-up: a stacked PR surfaces a failed stage's tasks in the REST API and 
TUI and distinguishes cancelled from failed (the task list is empty today, see 
last screenshot).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to