[I] Refactor and improve job data cleanup logic [datafusion-ballista]

via GitHub Thu, 11 Sep 2025 03:37:14 -0700


KR-bluejay opened a new issue, #1316:
URL: https://github.com/apache/datafusion-ballista/issues/1316


   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   There are several improvement points in the current job-data deletion flow:
   
   1. *Duplication*  
      The executor has `clean_all_shuffle_data` alongside other ad-hoc removal 
logic. These overlap in functionality, making the code harder to maintain and 
reason about.
   
   2. *Push-based broadcast*
      When the scheduler initiates cleanup, it currently notifies all 
executors. This is inefficient because only a subset of executors actually hold 
the job’s data.
   
   3. *Per-job deletion tasks* 
      In `clean_up_successful_job` / `clean_up_failed_job`, the scheduler 
spawns a separate delayed task (`sleep`) for each job and calls 
`state.remove_job(job_id)` individually. This results in many small tasks and 
RPCs, which could be batched more efficiently.
   
   **Describe the solution you'd like**
   Unify cleanup behind a single, testable “deletion facility”:
   
   1. *Deduplicate* logic with `clean_all_shuffle_data`; extract/keep a shared 
async remover (e.g., `remove_job_dir`) with safety checks.
   2. *Targeted push*: notify only executors that actually hold the job’s data 
(no broadcast).
   3. *Batching*: we already dispatch periodically; change each tick to send 
one batched `remove_jobs(Vec<JobId>)` for all pending IDs rather than spawning 
per-job sleeps and individual removals.
   
   **Describe alternatives you've considered**
   
   
   **Additional context**
   Related: #1219 , #1314 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Refactor and improve job data cleanup logic [datafusion-ballista]

Reply via email to