GWphua commented on code in PR #19187:
URL: https://github.com/apache/druid/pull/19187#discussion_r2973554284
##########
indexing-service/src/main/java/org/apache/druid/indexing/common/task/batch/parallel/ParallelIndexSupervisorTask.java:
##########
@@ -957,6 +964,62 @@ TaskStatus runRangePartitionMultiPhaseParallel(TaskToolbox
toolbox) throws Excep
return taskStatus;
}
+ /**
+ * Cleans up deep storage shuffle data produced during phase 1 of
multi-phase parallel indexing.
+ * <p>
+ * Cleanup is performed here in the supervisor task rather than in
+ * {@link
org.apache.druid.indexing.worker.shuffle.DeepStorageIntermediaryDataManager}
because of
+ * the process model: phase-1 sub-tasks run as separate peon processes that
exit before phase 2
+ * starts. Each sub-task's DeepStorageIntermediaryDataManager instance is
destroyed when the peon
+ * exits, so no surviving manager instance has knowledge of what files were
pushed. The supervisor
+ * task is the only entity that is both alive after phase 2 completes and
has the complete set of
+ * loadSpecs (collected from all sub-task reports).
+ * <p>
+ * This method constructs minimal {@link DataSegment} objects from {@link
DeepStoragePartitionStat} loadSpecs and
+ * delegates deletion to the appropriate storage-specific {@link
DataSegmentKiller}.
+ *
+ * @param killer the segment killer from {@link
TaskToolbox#getDataSegmentKiller()}.
+ * @param reports phase-1 sub-task reports containing partition stats with
loadSpecs,
+ * may be null or empty if phase 1 produced no output.
+ */
+ @VisibleForTesting
+ static void cleanupDeepStorageShuffleData(
Review Comment:
I did a test on HDFS. Here's some findings:
1. The HDFS shuffle-data are stored in somewhere like the following:
```
segments/shuffle-data/<taskId>/<startInterval>/<endInterval>/<partition>/<PartialIndexGeneratorTaskId>/0_index.zip
```
2. Current implementation of `kill` in `HdfsDataSegmentKiller` hard-codes
removal depth of 2 to 3, meaning we will still see **EMPTY** directories of the
following after all segments are killed:
```
-- Before --
segments/shuffle-data/<taskId>/<startInterval>/<endInterval>/<partition>/<PartialIndexGeneratorTaskId>/0_index.zip
-- After -- (Either can happen)
segments/shuffle-data/<taskId>/<startInterval> (Deleted 3 layers up)
segments/shuffle-data/<taskId>/<startInterval>/<endInterval> (Deleted 2
layers up)
```
3. I will add a new method in `DataSegmentKiller` interface solely for
handling intermediary tasks, and implementations will default to the `kill`
method. Users using other storage forms are welcome to implement their own
methods.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]