KR-bluejay opened a new issue, #1316:
URL: https://github.com/apache/datafusion-ballista/issues/1316
**Is your feature request related to a problem or challenge? Please describe
what you are trying to do.**
There are several improvement points in the current job-data deletion flow:
1. *Duplication*
The executor has `clean_all_shuffle_data` alongside other ad-hoc removal
logic. These overlap in functionality, making the code harder to maintain and
reason about.
2. *Push-based broadcast*
When the scheduler initiates cleanup, it currently notifies all
executors. This is inefficient because only a subset of executors actually hold
the job’s data.
3. *Per-job deletion tasks*
In `clean_up_successful_job` / `clean_up_failed_job`, the scheduler
spawns a separate delayed task (`sleep`) for each job and calls
`state.remove_job(job_id)` individually. This results in many small tasks and
RPCs, which could be batched more efficiently.
**Describe the solution you'd like**
Unify cleanup behind a single, testable “deletion facility”:
1. *Deduplicate* logic with `clean_all_shuffle_data`; extract/keep a shared
async remover (e.g., `remove_job_dir`) with safety checks.
2. *Targeted push*: notify only executors that actually hold the job’s data
(no broadcast).
3. *Batching*: we already dispatch periodically; change each tick to send
one batched `remove_jobs(Vec<JobId>)` for all pending IDs rather than spawning
per-job sleeps and individual removals.
**Describe alternatives you've considered**
**Additional context**
Related: #1219 , #1314
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]