[druid] branch master updated: docs(fix): add clarity around granularitySpec (#12362)

techdocsmith Wed, 06 Apr 2022 09:24:54 -0700

This is an automated email from the ASF dual-hosted git repository.

techdocsmith pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/druid.git



The following commit(s) were added to refs/heads/master by this push:
     new ac6c24793e docs(fix): add clarity around granularitySpec (#12362)
ac6c24793e is described below

commit ac6c24793e23672fa575d855d0cb5a3ba610f2bb
Author: 317brian <[email protected]>
AuthorDate: Wed Apr 6 09:24:37 2022 -0700

    docs(fix): add clarity around granularitySpec (#12362)
    
    * fix: add clarify around granularitySpec
    
    * fix spacing
    
    * Update docs/ingestion/compaction.md
    
    Co-authored-by: Victoria Lim <[email protected]>
    
    Co-authored-by: Victoria Lim <[email protected]>
---
 docs/ingestion/compaction.md | 30 ++++++++++++++++++++++++------
 1 file changed, 24 insertions(+), 6 deletions(-)

diff --git a/docs/ingestion/compaction.md b/docs/ingestion/compaction.md
index af38c4e153..379e1d3497 100644
--- a/docs/ingestion/compaction.md
+++ b/docs/ingestion/compaction.md
@@ -25,13 +25,16 @@ description: "Defines compaction and automatic compaction 
(auto-compaction or au
 Query performance in Apache Druid depends on optimally sized segments. 
Compaction is one strategy you can use to optimize segment size for your Druid 
database. Compaction tasks read an existing set of segments for a given time 
interval and combine the data into a new "compacted" set of segments. In some 
cases the compacted segments are larger, but there are fewer of them. In other 
cases the compacted segments may be smaller. Compaction tends to increase 
performance because optimized segm [...]
 
 ## Compaction strategies
+
 There are several cases to consider compaction for segment optimization:
+
 - With streaming ingestion, data can arrive out of chronological order 
creating lots of small segments.
 - If you append data using `appendToExisting` for [native 
batch](native-batch.md) ingestion creating suboptimal segments.
 - When you use `index_parallel` for parallel batch indexing and the parallel 
ingestion tasks create many small segments.
 - When a misconfigured ingestion task creates oversized segments.
 
 By default, compaction does not modify the underlying data of the segments. 
However, there are cases when you may want to modify data during compaction to 
improve query performance:
+
 - If, after ingestion, you realize that data for the time interval is sparse, 
you can use compaction to increase the segment granularity.
 - Over time you don't need fine-grained granularity for older data so you want 
use compaction to change older segments to a coarser query granularity. This 
reduces the storage space required for older data. For example from `minute` to 
`hour`, or `hour` to `day`. 
 - You can change the dimension order to improve sorting and reduce segment 
size.
@@ -46,34 +49,39 @@ You can configure the Druid Coordinator to perform 
automatic compaction, also ca
 Automatic compaction works in most use cases and should be your first option. 
To learn more about automatic compaction, see [Compacting 
Segments](../design/coordinator.md#compacting-segments).
 
 In cases where you require more control over compaction, you can manually 
submit compaction tasks. For example:
+
 - Automatic compaction is running into the limit of task slots available to 
it, so tasks are waiting for previous automatic compaction tasks to complete. 
Manual compaction can use all available task slots, therefore you can complete 
compaction more quickly by submitting more concurrent tasks for more intervals.
 - You want to force compaction for a specific time range or you want to 
compact data out of chronological order.
 
 See [Setting up a manual compaction task](#setting-up-manual-compaction) for 
more about manual compaction tasks.
 
 ## Data handling with compaction
+
 During compaction, Druid overwrites the original set of segments with the 
compacted set. Druid also locks the segments for the time interval being 
compacted to ensure data consistency. By default, compaction tasks do not 
modify the underlying data. You can configure the compaction task to change the 
query granularity or add or remove dimensions in the compaction task. This 
means that the only changes to query results should be the result of 
intentional, not automatic, changes.
 
 You can set `dropExisting` in `ioConfig` to "true" in the compaction task to 
configure Druid to replace all existing segments fully contained by the 
interval. See the suggestion for reindexing with finer granularity under 
[Implementation considerations](native-batch.md#implementation-considerations) 
for an example.
-> WARNING: `dropExisting` in `ioConfig` is a beta feature. 
+> WARNING: `dropExisting` in `ioConfig` is a beta feature.
 
 If an ingestion task needs to write data to a segment for a time interval 
locked for compaction, by default the ingestion task supersedes the compaction 
task and the compaction task fails without finishing. For manual compaction 
tasks you can adjust the input spec interval to avoid conflicts between 
ingestion and compaction. For automatic compaction, you can set the 
`skipOffsetFromLatest` key to adjust the auto compaction starting point from 
the current time to reduce the chance of confl [...]
 
 ### Segment granularity handling
 
-Unless you modify the segment granularity in the [granularity 
spec](#compaction-granularity-spec), Druid attempts to retain the granularity 
for the compacted segments. When segments have different segment granularities 
with no overlap in interval Druid creates a separate compaction task for each 
to retain the segment granularity in the compacted segment.
+Unless you modify the segment granularity in 
[`granularitySpec`](#compaction-granularity-spec), Druid attempts to retain the 
granularity for the compacted segments. When segments have different segment 
granularities with no overlap in interval Druid creates a separate compaction 
task for each to retain the segment granularity in the compacted segment.
+
+If segments have different segment granularities before compaction but there 
is some overlap in interval, Druid attempts find start and end of the 
overlapping interval and uses the closest segment granularity level for the 
compacted segment.
 
-If segments have different segment granularities before compaction but there 
is some overlap in interval, Druid attempts find start and end of the 
overlapping interval and uses the closest segment granularity level for the 
compacted segment. For example consider two overlapping segments: segment "A" 
for the interval 01/01/2021-01/02/2021 with day granularity and segment "B" for 
the interval 01/01/2021-02/01/2021. Druid attempts to combine and compacted the 
overlapped segments. In this ex [...]
+For example consider two overlapping segments: segment "A" for the interval 
01/01/2021-01/02/2021 with day granularity and segment "B" for the interval 
01/01/2021-02/01/2021. Druid attempts to combine and compact the overlapped 
segments. In this example, the earliest start time for the two segments is 
01/01/2020 and the latest end time of the two segments is 02/01/2020. Druid 
compacts the segments together even though they have different segment 
granularity. Druid uses month segment gran [...]
 
 ### Query granularity handling
 
-Unless you modify the query granularity in the [granularity 
spec](#compaction-granularity-spec), Druid retains the query granularity for 
the compacted segments. If segments have different query granularities before 
compaction, Druid chooses the finest level of granularity for the resulting 
compacted segment. For example if a compaction task combines two segments, one 
with day query granularity and one with minute query granularity, the resulting 
segment uses minute query granularity.
+Unless you modify the query granularity in the 
[`granularitySpec`](#compaction-granularity-spec), Druid retains the query 
granularity for the compacted segments. If segments have different query 
granularities before compaction, Druid chooses the finest level of granularity 
for the resulting compacted segment. For example if a compaction task combines 
two segments, one with day query granularity and one with minute query 
granularity, the resulting segment uses minute query granularity.
 
 > In Apache Druid 0.21.0 and prior, Druid sets the granularity for compacted 
 > segments to the default granularity of `NONE` regardless of the query 
 > granularity of the original segments.
 
 If you configure query granularity in compaction to go from a finer 
granularity like month to a coarser query granularity like year, then Druid 
overshadows the original segment with coarser granularity. Because the new 
segments have a coarser granularity, running a kill task to remove the 
overshadowed segments for those intervals will cause you to permanently lose 
the finer granularity data.
 
 ### Dimension handling
+
 Apache Druid supports schema changes. Therefore, dimensions can be different 
across segments even if they are a part of the same data source. See [Different 
schemas among 
segments](../design/segments.md#different-schemas-among-segments). If the input 
segments have different dimensions, the resulting compacted segment include all 
dimensions of the input segments. 
 
 Even when the input segments have the same set of dimensions, the dimension 
order or the data type of dimensions can be different. The dimensions of recent 
segments precede that of old segments in terms of data types and the ordering 
because more recent segments are more likely to have the preferred order and 
data types.
@@ -115,17 +123,18 @@ To perform a manual compaction, you submit a compaction 
task. Compaction tasks m
 |`segmentGranularity`|When set, the compaction task changes the segment 
granularity for the given interval.  Deprecated. Use `granularitySpec`. |No.|
 |`tuningConfig`|[Parallel indexing task 
tuningConfig](native-batch.md#tuningconfig). 
`awaitSegmentAvailabilityTimeoutMillis` in the tuning config is not currently 
supported for compaction tasks. Do not set it to a non-zero value.|No|
 |`context`|[Task context](./tasks.md#context)|No|
-|`granularitySpec`|Custom `granularitySpec`. The compaction task uses the 
specified `granularitySpec` rather than generating one. See [Compaction 
granularitySpec](#compaction-granularity-spec) for details.|No|
+|`granularitySpec`|Custom `granularitySpec`. The compaction task uses the 
specified `granularitySpec` rather than generating one. See [Compaction 
`granularitySpec`](#compaction-granularity-spec) for details.|No|
 
 > Note: Use `granularitySpec` over `segmentGranularity` and only set one of 
 > these values. If you specify different values for these in the same 
 > compaction spec, the task fails.
 
-To control the number of result segments per time chunk, you can set 
[maxRowsPerSegment](../configuration/index.md#compaction-dynamic-configuration) 
or [numShards](../ingestion/native-batch.md#tuningconfig).
+To control the number of result segments per time chunk, you can set 
[`maxRowsPerSegment`](../configuration/index.md#compaction-dynamic-configuration)
 or [`numShards`](../ingestion/native-batch.md#tuningconfig).
 
 > You can run multiple compaction tasks in parallel. For example, if you want 
 > to compact the data for a year, you are not limited to running a single task 
 > for the entire year. You can run 12 compaction tasks with month-long 
 > intervals.
 
 A compaction task internally generates an `index` task spec for performing 
compaction work with some fixed parameters. For example, its `inputSource` is 
always the [DruidInputSource](./native-batch-input-source.md), and 
`dimensionsSpec` and `metricsSpec` include all dimensions and metrics of the 
input segments by default.
 
 Compaction tasks exit without doing anything and issue a failure status code 
in either of the following cases:
+
 - If the interval you specify has no data segments loaded<br>
 - If the interval you specify is empty.
 
@@ -133,6 +142,7 @@ Note that the metadata between input segments and the 
resulting compacted segmen
 
 
 ### Example compaction task
+
 The following JSON illustrates a compaction task to compact _all segments_ 
within the interval `2020-01-01/2021-01-01` and create new segments:
 
 ```json
@@ -153,6 +163,9 @@ The following JSON illustrates a compaction task to compact 
_all segments_ withi
 }
 ```
 
+`granularitySpec` is an optional field.
+If you don't specify `granularitySpec`, Druid retains the original segment and 
query granularities when compaction is complete.
+
 ### Compaction I/O configuration
 
 The compaction `ioConfig` requires specifying `inputSpec` as follows:
@@ -181,17 +194,20 @@ Druid supports two supported `inputSpec` formats:
 |`segments`|A list of segment IDs|Yes|
 
 ### Compaction dimensions spec
+
 |Field|Description|Required|
 |-----|-----------|--------|
 |`dimensions`| A list of dimension names or objects. Cannot have the same 
column in both `dimensions` and `dimensionExclusions`. Defaults to `null`, 
which preserves the original dimensions.|No|
 |`dimensionExclusions`| The names of dimensions to exclude from compaction. 
Only names are supported here, not objects. This list is only used if the 
dimensions list is null or empty; otherwise it is ignored. Defaults to `[]`.|No|
 
 ### Compaction transform spec
+
 |Field|Description|Required|
 |-----|-----------|--------|
 |`filter`| The `filter` conditionally filters input rows during compaction. 
Only rows that pass the filter will be included in the compacted segments. Any 
of Druid's standard [query filters](../querying/filters.md) can be used. 
Defaults to 'null', which will not filter any row. |No|
 
 ### Compaction granularity spec
+
 |Field|Description|Required|
 |-----|-----------|--------|
 |`segmentGranularity`|Time chunking period for the segment granularity. 
Defaults to 'null', which preserves the original segment granularity. Accepts 
all [Query granularity](../querying/granularities.md) values.|No|
@@ -199,6 +215,7 @@ Druid supports two supported `inputSpec` formats:
 |`rollup`|Whether to enable ingestion-time rollup or not. Defaults to 'null', 
which preserves the original setting. Note that once data is rollup, individual 
records can no longer be recovered. |No|
 
 For example, to set the segment granularity to "day", the query granularity to 
"hour", and enabling rollup:
+
 ```json
 {
   "type": "compact",
@@ -219,6 +236,7 @@ For example, to set the segment granularity to "day", the 
query granularity to "
 ```
 
 ## Learn more
+
 See the following topics for more information:
 - [Segment optimization](../operations/segment-optimization.md) for guidance 
to determine if compaction will help in your case.
 - [Compacting Segments](../design/coordinator.md#compacting-segments) for more 
on automatic compaction.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[druid] branch master updated: docs(fix): add clarity around granularitySpec (#12362)

Reply via email to