kfaraz commented on code in PR #12569: URL: https://github.com/apache/druid/pull/12569#discussion_r883255406
########## docs/ingestion/automatic-compaction.md: ########## @@ -0,0 +1,196 @@ +--- +id: automatic-compaction +title: "Automatic compaction" +--- + +<!-- + ~ Licensed to the Apache Software Foundation (ASF) under one + ~ or more contributor license agreements. See the NOTICE file + ~ distributed with this work for additional information + ~ regarding copyright ownership. The ASF licenses this file + ~ to you under the Apache License, Version 2.0 (the + ~ "License"); you may not use this file except in compliance + ~ with the License. You may obtain a copy of the License at + ~ + ~ http://www.apache.org/licenses/LICENSE-2.0 + ~ + ~ Unless required by applicable law or agreed to in writing, + ~ software distributed under the License is distributed on an + ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + ~ KIND, either express or implied. See the License for the + ~ specific language governing permissions and limitations + ~ under the License. + --> + +In Apache Druid, compaction is a special type of ingestion task that reads data from a Druid datasource and writes it back into the same datasource. A common use case for this is to [optimally size segments](../operations/segment-optimization.md) after the data is ingested in Druid to improve query performance. Automatic compaction, or auto-compaction, refers to the system for automatic execution of compaction tasks managed by the [Druid Coordinator](../design/coordinator.md). + +The frequency of compaction tasks relies on the Coordinator [indexing period](../configuration/index.md#coordinator-operation), configured by `druid.coordinator.period.indexingPeriod`. +The default indexing period is 30 minutes, meaning that the Coordinator first checks for segments to compact at most 30 minutes from when auto-compaction is enabled. +Note this time period affects other Coordinator duties including merge and conversion tasks. +To configure the frequency of compaction tasks, [create a new duty group for the Coordinator](#set-frequency-of-compaction-runs). + +At every indexing period, the Coordinator initiates a [segment search](../design/coordinator.md#segment-search-policy-in-automatic-compaction) to determine eligible segments to compact. +When there are eligible segments to compact, the Coordinator issues compaction tasks based on available worker capacity. +If a compaction task takes longer than the indexing period, the Coordinator waits for it to finish before resuming the period for segment search. + +As a best practice, you should set up auto-compaction for all Druid datasources. You can run compaction tasks manually for cases where you want to allocate more system resources. For example, you may choose to run multiple compaction tasks in parallel to compact an existing datasource for the first time. See [Compaction](compaction.md) for additional details and use cases. + +This topic guides you through setting up automatic compaction for your Druid cluster. See the [examples](#examples) for common use cases for automatic compaction. + +## Enable automatic compaction + +You can enable automatic compaction for a datasource using the Druid console or programmatically via an API. +This process differs for manual compaction tasks, which can be submitted from the [Tasks view of the Druid console](../operations/druid-console.md) or the [Tasks API](../operations/api-reference.md#post-5). + +### Druid console + +Use the Druid console to enable automatic compaction for a datasource as follows. + +1. Click **Datasources** in the top-level navigation. +2. In the **Compaction** column, click the edit icon for the datasource to compact. +3. In the **Compaction config** dialog, configure the auto-compaction settings. The dialog offers a form view as well as a JSON view. Editing the form updates the JSON specification, and editing the JSON updates the form field, if present. Form fields not present in the JSON indicate default values. You may add additional properties to the JSON for auto-compaction settings not displayed in the form. See [Configure automatic compaction](#configure-automatic-compaction) for supported settings for auto-compaction. +4. Click **Submit**. +5. Refresh the **Datasources** view. The **Compaction** column for the datasource changes from “Not enabled” to “Awaiting first run.” + +The following screenshot shows the compaction config dialog for a datasource with auto-compaction enabled. + + +To disable auto-compaction for a datasource, click **Delete** from the **Compaction config** dialog. Druid does not retain your auto-compaction configuration. + +### Compaction configuration API + +Use the [Coordinator API](../operations/api-reference.md#automatic-compaction-status) to configure automatic compaction. +To enable auto-compaction for a datasource, create a JSON object with the desired auto-compaction settings. +See [Configure automatic compaction](#configure-automatic-compaction) for the syntax of an auto-compaction spec. +Send the JSON object as a payload in a [`POST` request](../operations/api-reference.md#post-4) to `/druid/coordinator/v1/config/compaction`. +The following example configures auto-compaction for the `wikipedia` datasource: + +```sh +curl --location --request POST 'http://localhost:8081/druid/coordinator/v1/config/compaction' \ +--header 'Content-Type: application/json' \ +--data-raw '{ + "dataSource": "wikipedia", + "granularitySpec": { + "segmentGranularity": "DAY" + } +}' +``` + +To disable auto-compaction for a datasource, send a [`DELETE` request](../operations/api-reference.md#delete-1) to `/druid/coordinator/v1/config/compaction/{dataSource}`. Replace `{dataSource}` with the name of the datasource for which to disable auto-compaction. For example: + +```sh +curl --location --request DELETE 'http://localhost:8081/druid/coordinator/v1/config/compaction/wikipedia' +``` + + +## Configure automatic compaction + +You can configure automatic compaction dynamically without restarting Druid. +The automatic compaction system uses the following syntax: + +```json +{ + "dataSource": <task_datasource>, + "ioConfig": <IO config>, + "dimensionsSpec": <custom dimensionsSpec>, + "transformSpec": <custom transformSpec>, + "metricsSpec": <custom metricsSpec>, + "tuningConfig": <parallel indexing task tuningConfig>, + "granularitySpec": <compaction task granularitySpec>, + "skipOffsetFromLatest": <time period to avoid compaction>, + "taskPriority": <compaction task priority>, + "taskContext": <task context> +} +``` + +Most fields in the auto-compaction configuration align with a typical [Druid ingestion spec](../ingestion/ingestion-spec.md). +The following properties only apply to auto-compaction: +* `skipOffsetFromLatest` +* `taskPriority` +* `taskContext` + +Since the automatic compaction system provides a management layer on top of manual compaction tasks, +the auto-compaction configuration does not include task-specific properties found in a typical Druid ingestion spec. +The following properties are automatically set by the Coordinator: +* `type`: Set to `compact`. +* `id`: Generated using the task type, datasource name, interval, and timestamp. +* `context`: Set according to the user-provided `taskContext`. + +For more details on each of the specs in an auto-compaction configuration, see [Automatic compaction dynamic configuration](../configuration/index.md#automatic-compaction-dynamic-configuration). + + +### Avoid conflicts with ingestion + +The Coordinator compacts segments from newest to oldest. In the auto-compaction configuration, you can set a time period, relative to the end time of the most recent segment, for segments that should not be compacted. Assign this value to `skipOffsetFromLatest`. Note that this offset is not relative to the current time but to the latest segment time. For example, if you want to skip over segments from thirty days prior to the end time of the most recent segment, assign `"skipOffsetFromLatest": "P30D"`. + +Compaction tasks that interfere with ingestion tasks will fail. For example, this occurs when an ingestion task needs to write data to a segment for a time interval locked for compaction. To facilitate the continuance of compaction tasks, consider one of the following strategies: +* Set `skipOffsetFromLatest` to reduce the chance of conflicts between ingestion and compaction. +* Increase the priority value of compaction tasks relative to ingestion tasks. In the auto-compaction configuration, set `taskPriority` to the desired priority value. See [Lock priority](../ingestion/tasks.md#lock-priority) for the priority values of different task types. Review Comment: Yeah, I think we shouldn't advise increasing priority at all. Or atleast under a big warn sign saying "for advanced users only" or something. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
