[druid] branch master updated: Undocumenting certain context parameter in MSQ. (#13928)

karan Thu, 16 Mar 2023 05:27:15 -0700

This is an automated email from the ASF dual-hosted git repository.

karan pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/druid.git



The following commit(s) were added to refs/heads/master by this push:
     new 67df1324ee Undocumenting certain context parameter in MSQ.  (#13928)
67df1324ee is described below

commit 67df1324eeffa24ebec0844ef091c673a8653e6a
Author: Karan Kumar <[email protected]>
AuthorDate: Thu Mar 16 17:56:44 2023 +0530

    Undocumenting certain context parameter in MSQ.  (#13928)
    
    * Removing intermediateSuperSorterStorageMaxLocalBytes, 
maxInputBytesPerWorker, composedIntermediateSuperSorterStorageEnabled, 
clusterStatisticsMergeMode from docs
    
    * Adding documentation in the context class.
---
 docs/multi-stage-query/reference.md                | 24 ---------------
 .../druid/msq/exec/ClusterStatisticsMergeMode.java | 18 ++++++++----
 .../druid/msq/util/MultiStageQueryContext.java     | 34 ++++++++++++++++++++--
 3 files changed, 45 insertions(+), 31 deletions(-)

diff --git a/docs/multi-stage-query/reference.md 
b/docs/multi-stage-query/reference.md
index 25eb827efc..32c38a60ab 100644
--- a/docs/multi-stage-query/reference.md
+++ b/docs/multi-stage-query/reference.md
@@ -598,12 +598,8 @@ The following table lists the context parameters for the 
MSQ task engine:
 | `maxParseExceptions`| SELECT, INSERT, REPLACE<br /><br />Maximum number of 
parse exceptions that are ignored while executing the query before it stops 
with `TooManyWarningsFault`. To ignore all the parse exceptions, set the value 
to -1.| 0 |
 | `rowsPerSegment` | INSERT or REPLACE<br /><br />The number of rows per 
segment to target. The actual number of rows per segment may be somewhat higher 
or lower than this number. In most cases, use the default. For general 
information about sizing rows per segment, see [Segment Size 
Optimization](../operations/segment-optimization.md). | 3,000,000 |
 | `indexSpec` | INSERT or REPLACE<br /><br />An 
[`indexSpec`](../ingestion/ingestion-spec.md#indexspec) to use when generating 
segments. May be a JSON string or object. See [Front 
coding](../ingestion/ingestion-spec.md#front-coding) for details on configuring 
an `indexSpec` with front coding. | See 
[`indexSpec`](../ingestion/ingestion-spec.md#indexspec). |
-| `clusterStatisticsMergeMode` | Whether to use parallel or sequential mode 
for merging of the worker sketches. Can be `PARALLEL`, `SEQUENTIAL` or `AUTO`. 
See [Sketch Merging Mode](#sketch-merging-mode) for more information. | 
`PARALLEL` |
 | `durableShuffleStorage` | SELECT, INSERT, REPLACE <br /><br />Whether to use 
durable storage for shuffle mesh. To use this feature, configure the durable 
storage at the server level using 
`druid.msq.intermediate.storage.enable=true`). If these properties are not 
configured, any query with the context variable `durableShuffleStorage=true` 
fails with a configuration error. <br /><br /> | `false` |
 | `faultTolerance` | SELECT, INSERT, REPLACE<br /><br /> Whether to turn on 
fault tolerance mode or not. Failed workers are retried based on 
[Limits](#limits). Cannot be used when `durableShuffleStorage` is explicitly 
set to false.  | `false` |
-| `composedIntermediateSuperSorterStorageEnabled` | SELECT, INSERT, REPLACE<br 
/><br /> Whether to enable automatic fallback to durable storage from local 
storage for sorting's intermediate data. Requires to setup 
`intermediateSuperSorterStorageMaxLocalBytes` limit for local storage and 
durable shuffle storage feature as well.| `false` |
-| `intermediateSuperSorterStorageMaxLocalBytes` | SELECT, INSERT, REPLACE<br 
/><br /> Whether to enable a byte limit on local storage for sorting's 
intermediate data. If that limit is crossed, the task fails with 
`ResourceLimitExceededException`.| `9223372036854775807` |
-| `maxInputBytesPerWorker` | Should be used in conjunction with taskAssignment 
`auto` mode. When dividing the input of a stage among the workers, this 
parameter determines the maximum size in bytes that are given to a single 
worker before the next worker is chosen. This parameter is only used as a 
guideline during input slicing, and does not guarantee that a the input cannot 
be larger. For example, we have 3 files. 3, 7, 12 GB each. then we would end up 
using 2 worker: worker 1 -> 3, 7 a [...]
 
 ## Joins
 
@@ -691,26 +687,6 @@ PARTITIONED BY HOUR
 CLUSTERED BY user
 ```
 
-## Sketch Merging Mode
-This section details the advantages and performance of various Cluster By 
Statistics Merge Modes.
-
-If a query requires key statistics to generate partition boundaries, key 
statistics are gathered by the workers while
-reading rows from the datasource. These statistics must be transferred to the 
controller to be merged together.
-`clusterStatisticsMergeMode` configures the way in which this happens.
-
-`PARALLEL` mode fetches the key statistics for all time chunks from all 
workers together and the controller then downsamples
-the sketch if it does not fit in memory. This is faster than `SEQUENTIAL` mode 
as there is less over head in fetching sketches
-for all time chunks together. This is good for small sketches which won't be 
down sampled even if merged together or if
-accuracy in segment sizing for the ingestion is not very important.
-
-`SEQUENTIAL` mode fetches the sketches in ascending order of time and 
generates the partition boundaries for one time
-chunk at a time. This gives more working memory to the controller for merging 
sketches, which results in less
-down sampling and thus, more accuracy. There is, however, a time overhead on 
fetching sketches in sequential order. This is
-good for cases where accuracy is important.
-
-`AUTO` mode tries to find the best approach based on number of workers. If 
there are more
-than 100 workers, `SEQUENTIAL` is chosen, otherwise, `PARALLEL` is chosen.
-
 ## Durable Storage
 
 This section enumerates the advantages and performance implications of 
enabling durable storage while executing MSQ tasks.
diff --git 
a/extensions-core/multi-stage-query/src/main/java/org/apache/druid/msq/exec/ClusterStatisticsMergeMode.java
 
b/extensions-core/multi-stage-query/src/main/java/org/apache/druid/msq/exec/ClusterStatisticsMergeMode.java
index 92ed82ff5e..18e6b3bb8a 100644
--- 
a/extensions-core/multi-stage-query/src/main/java/org/apache/druid/msq/exec/ClusterStatisticsMergeMode.java
+++ 
b/extensions-core/multi-stage-query/src/main/java/org/apache/druid/msq/exec/ClusterStatisticsMergeMode.java
@@ -20,23 +20,31 @@
 package org.apache.druid.msq.exec;
 
 /**
- * Mode which dictates how {@link WorkerSketchFetcher} gets sketches for the 
partition boundaries from workers.
+ * If a query requires key statistics to generate partition boundaries, key 
statistics are gathered by the workers while
+ * reading rows from the datasource. These statistics must be transferred to 
the controller to be merged together.
+ * The modes below dictates how {@link WorkerSketchFetcher} gets sketches for 
the partition boundaries from workers.
  */
 public enum ClusterStatisticsMergeMode
 {
   /**
-   * Fetches sketch in sequential order based on time. Slower due to overhead, 
but more accurate.
+   * Fetches the sketches in ascending order of time and generates the 
partition boundaries for one time
+   * chunk at a time. This gives more working memory to the controller for 
merging sketches, which results in less
+   * down sampling and thus, more accuracy. There is, however, a time overhead 
on fetching sketches in sequential order. This is
+   * good for cases where accuracy is important.
    */
   SEQUENTIAL,
 
   /**
-   * Fetch all sketches from the worker at once. Faster to generate 
partitions, but less accurate.
+   * Fetche the key statistics for all time chunks from all workers together. 
The controller then downsamples
+   * the sketch if it does not fit in memory. This is faster than `SEQUENTIAL` 
mode as there is less over head in fetching sketches
+   * for all time chunks together. This is good for small sketches which won't 
be down sampled even if merged together or if
+   * accuracy in segment sizing for the ingestion is not very important.
    */
   PARALLEL,
 
   /**
-   * Tries to decide between sequential and parallel modes based on the number 
of workers and size of the input
-   *
+   * Tries to decide between sequential and parallel modes based on the number 
of workers and size of the input.
+   * <p>
    * If there are more than 100 workers or if the combined sketch size among 
all workers is more than
    * 1,000,000,000 bytes, SEQUENTIAL mode is chosen, otherwise, PARALLEL mode 
is chosen.
    */
diff --git 
a/extensions-core/multi-stage-query/src/main/java/org/apache/druid/msq/util/MultiStageQueryContext.java
 
b/extensions-core/multi-stage-query/src/main/java/org/apache/druid/msq/util/MultiStageQueryContext.java
index b03c1111fe..1bb247587a 100644
--- 
a/extensions-core/multi-stage-query/src/main/java/org/apache/druid/msq/util/MultiStageQueryContext.java
+++ 
b/extensions-core/multi-stage-query/src/main/java/org/apache/druid/msq/util/MultiStageQueryContext.java
@@ -43,7 +43,35 @@ import java.util.stream.Collectors;
 
 /**
  * Class for all MSQ context params
- */
+ * <p>
+ * One of the design goals for MSQ is to have less turning parameters. If a 
parameter is not expected to change the result
+ * of a job, but only how the job runs, it's a parameter we can skip 
documenting in external docs.
+ * </p>
+ * <br></br>
+ * List of context parameters not present in external docs:
+ * <br></br>
+ * <ol>
+ * <li><b>composedIntermediateSuperSorterStorageEnabled</b>: Whether to enable 
automatic fallback to durable storage from
+ * local storage for sorting's intermediate data. Requires to set-up 
`intermediateSuperSorterStorageMaxLocalBytes` limit
+ * for local storage and durable shuffle storage feature as well. Default 
value is <b>false</b>.</li>
+ *
+ * <li><b>intermediateSuperSorterStorageMaxLocalBytes</b>: Whether to enable a 
byte limit on local storage for
+ * sorting's intermediate data. If that limit is crossed,the task fails with 
{@link org.apache.druid.query.ResourceLimitExceededException}`.
+ * Default value is <b>9223372036854775807</b> </li>
+ *
+ * <li><b>maxInputBytesPerWorker</b>: Should be used in conjunction with 
taskAssignment `auto` mode. When dividing the
+ * input of a stage among the workers, this parameter determines the maximum 
size in bytes that are given to a single worker
+ * before the next worker is chosen.This parameter is only used as a guideline 
during input slicing, and does not guarantee
+ * that a the input cannot be larger.
+ * <br></br>
+ * For example, we have 3 files. 3, 7, 12 GB each. then we would end up using 
2 worker: worker 1 -> 3, 7 and worker 2 -> 12.
+ * This value is used for all stages in a query. Default valus is: 
<b>10737418240</b></li>
+ *
+ * <li><b>clusterStatisticsMergeMode</b>: Whether to use parallel or 
sequential mode for merging of the worker sketches.
+ * Can be <b>PARALLEL</b>, <b>SEQUENTIAL</b> or <b>AUTO</b>. See {@link 
ClusterStatisticsMergeMode} for more information on each mode.
+ * Default value is <b>PARALLEL</b></li>
+ * </ol>
+ **/
 public class MultiStageQueryContext
 {
   public static final String CTX_MSQ_MODE = "mode";
@@ -227,7 +255,9 @@ public class MultiStageQueryContext
       try {
         // Not caching this ObjectMapper in a static, because we expect to use 
it infrequently (once per INSERT
         // query that uses this feature) and there is no need to keep it 
around longer than that.
-        return new ObjectMapper().readValue(sortOrderString, new 
TypeReference<List<String>>() {});
+        return new ObjectMapper().readValue(sortOrderString, new 
TypeReference<List<String>>()
+        {
+        });
       }
       catch (JsonProcessingException e) {
         throw QueryContexts.badValueException(CTX_SORT_ORDER, "CSV or JSON 
array", sortOrderString);


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[druid] branch master updated: Undocumenting certain context parameter in MSQ. (#13928)

Reply via email to