FrankChen021 commented on code in PR #19187:
URL: https://github.com/apache/druid/pull/19187#discussion_r2982057149
##########
docs/configuration/index.md:
##########
@@ -1376,7 +1376,7 @@ Processing properties set on the Middle Manager are
passed through to Peons.
|`druid.processing.numTimeoutThreads`|The number of processing threads to have
available for handling per-segment query timeouts. Setting this value to `0`
removes the ability to service per-segment timeouts, irrespective of
`perSegmentTimeout` query context parameter. As these threads are just
servicing timers, it's recommended to set this value to some small percent
(e.g. 5%) of the total query processing cores available to the peon.|0|
|`druid.processing.fifo`|Enables the processing queue to treat tasks of equal
priority in a FIFO manner.|`true`|
|`druid.processing.tmpDir`|Path where temporary files created while processing
a query should be stored. If specified, this configuration takes priority over
the default `java.io.tmpdir` path.|path represented by `java.io.tmpdir`|
-|`druid.processing.intermediaryData.storage.type`|Storage type for
intermediary segments of data shuffle between native parallel index tasks. <br
/>Set to `local` to store segment files in the local storage of the Middle
Manager or Indexer. <br />Set to `deepstore` to use configured deep storage for
better fault tolerance during rolling updates. When the storage type is
`deepstore`, Druid stores the data in the `shuffle-data` directory under the
configured deep storage path. Druid does not support automated cleanup for the
`shuffle-data` directory. You can set up cloud storage lifecycle rules for
automated cleanup of data at the `shuffle-data` prefix location.|`local`|
+|`druid.processing.intermediaryData.storage.type`|Storage type for
intermediary segments of data shuffle between native parallel index tasks. <br
/>Set to `local` to store segment files in the local storage of the Middle
Manager or Indexer. <br />Set to `deepstore` to use configured deep storage for
better fault tolerance during rolling updates. When the storage type is
`deepstore`, Druid stores the data in the `shuffle-data` directory under the
configured deep storage path. Druid automatically cleans up shuffle data from
deep storage when the parallel indexing task completes.|`local`|
Review Comment:
I think the description is not correct. We only support to auto clean up
files HDFS deep storage
##########
processing/src/main/java/org/apache/druid/segment/loading/DataSegmentKiller.java:
##########
@@ -98,4 +103,22 @@ default void killQuietly(DataSegment segment)
* is only implemented by local and HDFS.
*/
void killAll() throws IOException;
+
+ /**
+ * Best-effort removal of all deep-storage shuffle intermediates for a
native parallel index supervisor task.
+ * Native parallel indexing writes shuffle files only under {@code
shuffle-data/<supervisorTaskId>/} (see
+ * {@link #SHUFFLE_DATA_DIR_NAME} and {@code
org.apache.druid.indexing.worker.shuffle.DeepStorageIntermediaryDataManager});
+ * the default implementation is a no-op.
+ * <p>
+ * Object stores (S3, GCS, Azure Blob, etc.): there is usually no
recursive-delete primitive; implementors
+ * should list objects under the key prefix {@code
shuffle-data/<supervisorTaskId>/} (or the extension's equivalent
+ * layout under the configured bucket/prefix), delete in pages, and tolerate
missing keys (idempotent cleanup). Use
+ * batch delete APIs where available. Be careful with listing consistency:
eventual consistency and pagination
+ * boundaries may require retries or a second list pass. Never delete keys
outside that prefix (other supervisors
+ * share {@code shuffle-data/}). If the supervisor JVM dies before {@code
cleanUp}, operators can remove the same
+ * prefix manually.
+ */
+ default void killShuffleSupervisorPrefix(String supervisorTaskId) throws
SegmentLoadingException
Review Comment:
why does it throw `SegmentLoadingException` exception
##########
processing/src/main/java/org/apache/druid/segment/loading/DataSegmentKiller.java:
##########
@@ -98,4 +103,22 @@ default void killQuietly(DataSegment segment)
* is only implemented by local and HDFS.
*/
void killAll() throws IOException;
+
+ /**
+ * Best-effort removal of all deep-storage shuffle intermediates for a
native parallel index supervisor task.
+ * Native parallel indexing writes shuffle files only under {@code
shuffle-data/<supervisorTaskId>/} (see
+ * {@link #SHUFFLE_DATA_DIR_NAME} and {@code
org.apache.druid.indexing.worker.shuffle.DeepStorageIntermediaryDataManager});
+ * the default implementation is a no-op.
+ * <p>
+ * Object stores (S3, GCS, Azure Blob, etc.): there is usually no
recursive-delete primitive; implementors
+ * should list objects under the key prefix {@code
shuffle-data/<supervisorTaskId>/} (or the extension's equivalent
+ * layout under the configured bucket/prefix), delete in pages, and tolerate
missing keys (idempotent cleanup). Use
+ * batch delete APIs where available. Be careful with listing consistency:
eventual consistency and pagination
+ * boundaries may require retries or a second list pass. Never delete keys
outside that prefix (other supervisors
+ * share {@code shuffle-data/}). If the supervisor JVM dies before {@code
cleanUp}, operators can remove the same
+ * prefix manually.
+ */
+ default void killShuffleSupervisorPrefix(String supervisorTaskId) throws
SegmentLoadingException
Review Comment:
the abstraction(name) is not good. the segment killer should be aware of the
concepts like supervisor, shuffle, it should be only aware of dir/folder
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]