FrankChen021 opened a new issue, #19163: URL: https://github.com/apache/druid/issues/19163
### Background Druid supports `druid.processing.intermediaryData.storage.type=deepstore` for native parallel indexing, introduced to address reliability issues such as rolling updates / worker loss during shuffle. In this mode, phase-1 tasks write shuffle/intermediary data under the `shuffle-data` prefix in deep storage. ### Problem These deep-storage shuffle artifacts are not cleaned up after the parallel indexing task completes. The current local intermediary data path has cleanup behavior, but the deepstore path does not appear to have an equivalent end-to-end cleanup flow. In particular, cleanup cannot rely on the producing worker/task process still being alive, since the task may exit before deletion is triggered. As a result, completed parallel indexing tasks can leave behind many residual files/directories under `shuffle-data`, which accumulate over time in deep storage. **Even the document calls out the auto clean up is not implemented and require retention policy configured at deep storage side, I don't think we should leave the responsibility to deep storage. And for hdfs deep storage, there's no such built-in auto clean up policy.** <img width="1024" height="451" alt="Image" src="https://github.com/user-attachments/assets/715a2e73-71a4-4997-a2bc-ca3460fc72e6" /> ### Proposal Trigger deepstore intermediary-data cleanup from the supervisor task after the parallel indexing flow completes. This would make cleanup deterministic and consistent with the lifecycle of the overall parallel indexing task. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
