GWphua commented on code in PR #19187:
URL: https://github.com/apache/druid/pull/19187#discussion_r2972497357
##########
indexing-service/src/main/java/org/apache/druid/indexing/common/task/batch/parallel/ParallelIndexSupervisorTask.java:
##########
@@ -957,6 +964,62 @@ TaskStatus runRangePartitionMultiPhaseParallel(TaskToolbox
toolbox) throws Excep
return taskStatus;
}
+ /**
+ * Cleans up deep storage shuffle data produced during phase 1 of
multi-phase parallel indexing.
+ * <p>
+ * Cleanup is performed here in the supervisor task rather than in
+ * {@link
org.apache.druid.indexing.worker.shuffle.DeepStorageIntermediaryDataManager}
because of
+ * the process model: phase-1 sub-tasks run as separate peon processes that
exit before phase 2
+ * starts. Each sub-task's DeepStorageIntermediaryDataManager instance is
destroyed when the peon
+ * exits, so no surviving manager instance has knowledge of what files were
pushed. The supervisor
+ * task is the only entity that is both alive after phase 2 completes and
has the complete set of
+ * loadSpecs (collected from all sub-task reports).
+ * <p>
+ * This method constructs minimal {@link DataSegment} objects from {@link
DeepStoragePartitionStat} loadSpecs and
+ * delegates deletion to the appropriate storage-specific {@link
DataSegmentKiller}.
+ *
+ * @param killer the segment killer from {@link
TaskToolbox#getDataSegmentKiller()}.
+ * @param reports phase-1 sub-task reports containing partition stats with
loadSpecs,
+ * may be null or empty if phase 1 produced no output.
+ */
+ @VisibleForTesting
+ static void cleanupDeepStorageShuffleData(
Review Comment:
Depends on the implementation of the `DataSegmentKiller`.
From what I see, all file storage implementations like
`LocalDataSegmentKiller`, `HdfsDataSegmentKiller` and `AzureDataSegmentKiller`
will remove the dir if it is empty.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]