FrankChen021 opened a new issue, #19163:
URL: https://github.com/apache/druid/issues/19163

   ### Background
   
   Druid supports `druid.processing.intermediaryData.storage.type=deepstore` 
for native parallel indexing, introduced to address reliability issues such as 
rolling updates / worker loss during shuffle.
   
   In this mode, phase-1 tasks write shuffle/intermediary data under the 
`shuffle-data` prefix in deep storage.
   
   ### Problem
   
   These deep-storage shuffle artifacts are not cleaned up after the parallel 
indexing task completes.
   
   The current local intermediary data path has cleanup behavior, but the 
deepstore path does not appear to have an equivalent end-to-end cleanup flow. 
In particular, cleanup cannot rely on the producing worker/task process still 
being alive, since the task may exit before deletion is triggered.
   
   As a result, completed parallel indexing tasks can leave behind many 
residual files/directories under `shuffle-data`, which accumulate over time in 
deep storage.
   
   **Even the document calls out the auto clean up is not implemented and 
require retention policy configured at deep storage side, I don't think we 
should leave the responsibility to deep storage.
   And for hdfs deep storage, there's no such built-in auto clean up policy.**
   
   <img width="1024" height="451" alt="Image" 
src="https://github.com/user-attachments/assets/715a2e73-71a4-4997-a2bc-ca3460fc72e6";
 />
   
   ### Proposal
   
   Trigger deepstore intermediary-data cleanup from the supervisor task after 
the parallel indexing flow completes.
   
   This would make cleanup deterministic and consistent with the lifecycle of 
the overall parallel indexing task.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to