FLINK-38698: adaptive scheduler BLOB leak, proposed fix, requesting assignment

Spoorthi Basu Sun, 07 Jun 2026 22:38:56 -0700

Hi all,

I'd like to pick up FLINK-38698, a critical bug where offloaded
TaskInformation BLOBs accumulate without cleanup and can exhaust the BLOB
store on long-running jobs.


Root cause: The adaptive scheduler rebuilds the ExecutionGraph on every
restart or rescale, and each rebuild re-offloads deployment metadata to the
BLOB store under fresh keys. The superseded graph's BLOBs are only removed
at global job termination, so on a long-running job that restarts or
rescales repeatedly they accumulate and are never reclaimed for the
lifetime of the job. The default scheduler is not affected, since it reuses
the same ExecutionGraph.

Proposed fix: Release a graph's offloaded deployment BLOBs at the points
where the adaptive scheduler discards that graph, reusing the existing
BLOB-deletion path on DefaultExecutionGraph. The release is scoped to
graphs that are actually being discarded, so it never touches BLOBs a live
graph still needs. No state, wire-format, or public API changes. I have
implemented this locally and validated it with tests covering the discard
cases.

I left the same proposal on the JIRA on June 2. Could a committer please
review the approach, and if it looks reasonable, assign the ticket to me so
I can open a PR?

Thanks,
Spoorthi Basu

FLINK-38698: adaptive scheduler BLOB leak, proposed fix, requesting assignment

Reply via email to