[GitHub] [druid] cryptoe commented on a diff in pull request #14035: msq: add durable storage info

via GitHub Fri, 07 Apr 2023 09:46:36 -0700


cryptoe commented on code in PR #14035:
URL: https://github.com/apache/druid/pull/14035#discussion_r1160824147



##########
docs/multi-stage-query/reference.md:
##########
@@ -696,19 +696,55 @@ CLUSTERED BY user
 
 ## Durable Storage
 
-This section enumerates the advantages and performance implications of 
enabling durable storage while executing MSQ tasks.
+Using durable storage with your SQL-based ingestions can improve their 
reliability by writing intermediate files to a storage location temporarily. 
 
 To prevent durable storage from getting filled up with temporary files in case 
the tasks fail to clean them up, a periodic
 cleaner can be scheduled to clean the directories corresponding to which there 
isn't a controller task running. It utilizes
 the storage connector to work upon the durable storage. The durable storage 
location should only be utilized to store the output
 for cluster's MSQ tasks. If the location contains other files or directories, 
then they will get cleaned up as well.
-Following table lists the properties that can be set to control the behavior 
of the durable storage of the cluster.
+
+### Enable durable storage
+
+To enable durable storage, you need to set the following common service 
properties:
+
+```
+druid.msq.intermediate.storage.enable=true
+druid.msq.intermediate.storage.type=s3
+druid.msq.intermediate.storage.bucket=YOUR_BUCKET
+druid.msq.intermediate.storage.prefix=YOUR_PREFIX
+druid.msq.intermediate.storage.tempDir=/path/to/your/temp/dir
+```
+
+For information about these settings and others related to durable storage, 
see [Durable storage configurations](#durable-storage-configurations).

Review Comment:
   ```suggestion
   For detailed information about the settings related to durable storage, see 
[Durable storage configurations](#durable-storage-configurations).
   ```



##########
docs/multi-stage-query/reference.md:
##########
@@ -696,19 +696,55 @@ CLUSTERED BY user
 
 ## Durable Storage
 
-This section enumerates the advantages and performance implications of 
enabling durable storage while executing MSQ tasks.
+Using durable storage with your SQL-based ingestions can improve their 
reliability by writing intermediate files to a storage location temporarily. 
 
 To prevent durable storage from getting filled up with temporary files in case 
the tasks fail to clean them up, a periodic
 cleaner can be scheduled to clean the directories corresponding to which there 
isn't a controller task running. It utilizes
 the storage connector to work upon the durable storage. The durable storage 
location should only be utilized to store the output
 for cluster's MSQ tasks. If the location contains other files or directories, 
then they will get cleaned up as well.
-Following table lists the properties that can be set to control the behavior 
of the durable storage of the cluster.
+
+### Enable durable storage
+
+To enable durable storage, you need to set the following common service 
properties:
+
+```
+druid.msq.intermediate.storage.enable=true
+druid.msq.intermediate.storage.type=s3
+druid.msq.intermediate.storage.bucket=YOUR_BUCKET
+druid.msq.intermediate.storage.prefix=YOUR_PREFIX
+druid.msq.intermediate.storage.tempDir=/path/to/your/temp/dir
+```
+
+For information about these settings and others related to durable storage, 
see [Durable storage configurations](#durable-storage-configurations).
+
+
+### Use durable storage for queries
+
+When you run a query,  include the context parameter `durableShuffleStorage` 
and set it to `true`. 
+
+For queries where you want to use fault tolerance for workers,  set 
`faultTolerance` to `true`, which automatically sets `durableShuffleStorage` to 
`true`.
+
+## Durable storage configurations
+
+The following common service properties control how durable storage behaves:
+
+|Parameter          |Default                                 | Description     
     |
+|-------------------|----------------------------------------|----------------------|
+|`druid.msq.intermediate.storage.bucket` | n/a | The bucket in S3 where you 
want to store intermediate files.  |
+| `druid.msq.intermediate.storage.chunkSize` | n/a | Optional. Defines the 
size of each chunk to temporarily store in 
`druid.msq.intermediate.storage.tempDir`. The chunk size must be between 5 MiB 
and 5 GiB. Druid computes the chunk size automatically if no value is 
provided.| 
+|`druid.msq.intermediate.storage.enable` | true | Required. Whether to enable 
durable storage for the cluster.|
+| `druid.msq.intermediate.storage.maxTriesOnTransientErrors` | 10 | Optional. 
Defines the max number times to attempt S3 API calls to avoid failures due to 
transient errors. | 
+|`druid.msq.intermediate.storage.type` | `s3` if your deep storage is S3 | 
Required. The type of storage to use. You can either set this to `local` or 
`s3`.  |
+|`druid.msq.intermediate.storage.prefix` | n/a | S3 prefix to store 
intermediate stage results. Provide a unique value for the prefix. Don't share 
the same prefix between clusters. If the location  includes other files or 
directories, then they will get cleaned up as well.  |
+| `druid.msq.intermediate.storage.tempDir`|  | Required. Directory path on the 
local disk to temporarily store intermediate stage results.  |
+
+In addition to the common service properties, there are certain properties 
that you configure on the Overlord specifically:

Review Comment:
   ```suggestion
   In addition to the common service properties, there are certain properties 
that you configure on the Overlord specifically to clean up intermediate files:
   ```



##########
docs/multi-stage-query/security.md:
##########
@@ -50,3 +50,17 @@ To interact with a query through the Overlord API, users 
need the following perm
 
 - `INSERT` or `REPLACE` queries: Users must have READ DATASOURCE permission on 
the output datasource.
 - `SELECT` queries: Users must have read permissions on the `__query_select` 
datasource, which is a stub datasource that gets created.
+
+## S3
+
+The MSQ task engine can use S3 to store intermediate files when running 
queries. This can increase its reliability but requires certain permissions in 
S3.

Review Comment:
   ```suggestion
   The MSQ task engine can use S3 to store intermediate files when running 
queries. This can increase its reliability but requires certain permissions in 
S3.
   These permissions are required if you configure durable storage. 
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] cryptoe commented on a diff in pull request #14035: msq: add durable storage info

Reply via email to