Re: [PR] docs: msq autocompaction (druid)

via GitHub Thu, 22 Aug 2024 23:02:16 -0700


gargvishesh commented on code in PR #16681:
URL: https://github.com/apache/druid/pull/16681#discussion_r1716366537



##########
docs/multi-stage-query/known-issues.md:
##########
@@ -68,3 +68,17 @@ properties, and the `indexSpec` 
[`tuningConfig`](../ingestion/ingestion-spec.md#
 - The maximum number of elements in a window cannot exceed a value of 100,000. 
 - To avoid `leafOperators` in MSQ engine, window functions have an extra scan 
stage after the window stage for cases 
 where native engine has a non-empty `leafOperator`.
+
+## Automatic compaction
+
+<!--This list also exists in data-management/automatic-compaction-->
+
+The following known issues and limitations affect automatic compaction with 
the MSQ task engine:
+
+- Only range-based partitioning is supported
+- You cannot group or roll up metrics for dimensions 
+- You cannot group on multi-value dimensions
+- The `maxTotalRows` config is not supported. Use `maxRowsPerSegment` instead.
+- `queryGranularity` cannot be set to `all`

Review Comment:
   The `queryGranularity` limitation can be removed.



##########
docs/data-management/automatic-compaction.md:
##########
@@ -131,6 +131,52 @@ maximize performance and minimize disk usage of the 
`compact` tasks launched by
 
 For more details on each of the specs in an auto-compaction configuration, see 
[Automatic compaction dynamic 
configuration](../configuration/index.md#automatic-compaction-dynamic-configuration).
 
+### Compaction engine
+
+When you configure automatic compaction, you can specify whether Druid uses 
the native engine or the multi-stage query (MSQ) task engine to perform the 
compaction.  The native engine was the only engine available for compaction 
prior to the introduction of the MSQ task engine and the corresponding `engine` 
context parameter. 
+
+Using the MSQ task engine for compaction provides faster compaction times as 
well as better memory tuning and usage. For more information about the MSQ task 
engine, see [MSQ task engine concepts](../multi-stage-query/concepts.md).
+
+To use the native compaction engine, either omit the `engine` config when 
submitting your compaction task spec or set it to `native`.
+
+To use the MSQ task engine for automatic compaction, do the following:
+
+* Have the [MSQ  task engine extension 
loaded](../multi-stage-query/index.md#load-the-extension).
+* In the compaction task spec for a datasource, set `compactionConfigs.engine` 
to `msq`. The default is `native`.
+* Have at least two compaction task slots available or set 
`compactionConfig.taskContext.maxNumTasks` to two or more. The MSQ task engine 
requires at least two tasks to run, one controller task and one worker task.
+
+Keep the following limitations in mind MSQ task engine for auto-compaction:
+
+<!--Duplicated in multi-stage-query/known-issues.md-->
+
+- Only range-based partitioning is supported
+- You cannot group or roll up metrics for dimensions 
+- You cannot group on multi-value dimensions
+- The `maxTotalRows` config is not supported. Use `maxRowsPerSegment` instead.
+- `queryGranularity` cannot be set to `all`

Review Comment:
   Some things have changed now. We can use these (or something similar -- esp 
for the 1st point below) instead:
   
   * `metricsSpec` in compaction config only supported if it has idempotent 
aggregators, i.e. aggregators that can be repeatedly applied on the same column 
to produce correct results. E.g. 
   `{"name": "added", "type": "longSum", "fieldName": "added"}` is idempotent
   but 
   `{"name": "sum_added", "type": "longSum", "fieldName": "added" }` (rolls up 
`added` column to a different `sum_added` column), 
   `{"name": added, "type":"", fieldName: added}` (partial sketches can be 
merged only with HLLSketchMergeAggregatorFactory)
   `{"name": "count", "type": "count"}` (rolls up to a different `count` column)
   aren't.
   * Only dynamic and range-based partitioning are supported.
   * `rollup` should be set to `true` if and only if `metricsSpec` is specified
   * The `maxTotalRows` config is not supported in `DynamicPartitionsSpec`. Use 
`maxRowsPerSegment` instead.
   



##########
docs/data-management/automatic-compaction.md:
##########
@@ -131,6 +131,52 @@ maximize performance and minimize disk usage of the 
`compact` tasks launched by
 
 For more details on each of the specs in an auto-compaction configuration, see 
[Automatic compaction dynamic 
configuration](../configuration/index.md#automatic-compaction-dynamic-configuration).
 
+### Compaction engine
+
+When you configure automatic compaction, you can specify whether Druid uses 
the native engine or the multi-stage query (MSQ) task engine to perform the 
compaction.  The native engine was the only engine available for compaction 
prior to the introduction of the MSQ task engine and the corresponding `engine` 
context parameter. 
+
+Using the MSQ task engine for compaction provides faster compaction times as 
well as better memory tuning and usage. For more information about the MSQ task 
engine, see [MSQ task engine concepts](../multi-stage-query/concepts.md).
+
+To use the native compaction engine, either omit the `engine` config when 
submitting your compaction task spec or set it to `native`.
+
+To use the MSQ task engine for automatic compaction, do the following:
+
+* Have the [MSQ  task engine extension 
loaded](../multi-stage-query/index.md#load-the-extension).
+* In the compaction task spec for a datasource, set `compactionConfigs.engine` 
to `msq`. The default is `native`.
+* Have at least two compaction task slots available or set 
`compactionConfig.taskContext.maxNumTasks` to two or more. The MSQ task engine 
requires at least two tasks to run, one controller task and one worker task.
+
+Keep the following limitations in mind MSQ task engine for auto-compaction:
+
+<!--Duplicated in multi-stage-query/known-issues.md-->
+
+- Only range-based partitioning is supported
+- You cannot group or roll up metrics for dimensions 
+- You cannot group on multi-value dimensions
+- The `maxTotalRows` config is not supported. Use `maxRowsPerSegment` instead.
+- `queryGranularity` cannot be set to `all`
+
+#### MSQ task engine context parameters
+
+You can use [MSQ task engine context parameters](../multi-stage-query/) in 
`compactionConfig.taskContext` when configuring your datasource for automatic 
compaction, such as setting the maximum number of tasks using the 
`compactionConfig.taskContext.maxNumTasks` parameter. Some of the MSQ task 
engine context parameters overlap with automatic compaction parameters. When 
these settings overlap, set one or the other.
+
+The following table has the MSQ task engine context parameter first with the 
native context parameter in parenthesis:
+
+| MSQ task engine context parameter | Automatic compaction config |
+|--------------------------------------------|---------------------------------------------|
+| `context.priority`                         | `taskPriority`                  
            |
+| `context.rowsPerSegment`                   | 
`tuningConfig.targetRowsPerSegment`         |
+| `context.priority`                         | `taskContext.priority`          
            |
+| `context.storeCompactionState`             | 
`taskContext.storeCompactionState`          |
+| `sqlQueryContext.sqlInsertSegmentGranularity` | 
`granularitySpec.segmentGranularity`      |
+| `spec.query.dataSource` or `dataSource`    | `dataSource`                    
            |
+| `spec.tuningConfig.indexSpec`              | `tuningConfig.indexSpec`        
            |
+| `spec.query.orederBy`                      | `tuningConfig.indexSpec`        
            |
+| `spec.query.granularity`                   | 
`granularitySpec.queryGranularity`          |
+| `spec.query.dimensions`                    | `dimensionsSpec`                
            |
+| `spec.query.filter`                        | `transformSpec.filter`          
            |
+| `spec.query.aggregations`                  | `metricsSpec`                   
            |
+
+

Review Comment:
   This in an internal detail of how the full compaction config translates to 
an MSQ task spec and immaterial to the user, so can be skipped.



##########
docs/data-management/automatic-compaction.md:
##########
@@ -131,6 +131,52 @@ maximize performance and minimize disk usage of the 
`compact` tasks launched by
 
 For more details on each of the specs in an auto-compaction configuration, see 
[Automatic compaction dynamic 
configuration](../configuration/index.md#automatic-compaction-dynamic-configuration).
 
+### Compaction engine
+
+When you configure automatic compaction, you can specify whether Druid uses 
the native engine or the multi-stage query (MSQ) task engine to perform the 
compaction.  The native engine was the only engine available for compaction 
prior to the introduction of the MSQ task engine and the corresponding `engine` 
context parameter. 
+
+Using the MSQ task engine for compaction provides faster compaction times as 
well as better memory tuning and usage. For more information about the MSQ task 
engine, see [MSQ task engine concepts](../multi-stage-query/concepts.md).
+
+To use the native compaction engine, either omit the `engine` config when 
submitting your compaction task spec or set it to `native`.
+
+To use the MSQ task engine for automatic compaction, do the following:
+
+* Have the [MSQ  task engine extension 
loaded](../multi-stage-query/index.md#load-the-extension).
+* In the compaction task spec for a datasource, set `compactionConfigs.engine` 
to `msq`. The default is `native`.
+* Have at least two compaction task slots available or set 
`compactionConfig.taskContext.maxNumTasks` to two or more. The MSQ task engine 
requires at least two tasks to run, one controller task and one worker task.
+
+Keep the following limitations in mind MSQ task engine for auto-compaction:

Review Comment:
   ```suggestion
   Keep the following limitations in mind when using MSQ task engine for 
auto-compaction:
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] docs: msq autocompaction (druid)

Reply via email to