[GitHub] [druid] vtlim commented on a change in pull request #11541: Docs Ingestion page refactor

GitBox Tue, 03 Aug 2021 14:30:20 -0700


vtlim commented on a change in pull request #11541:
URL: https://github.com/apache/druid/pull/11541#discussion_r682074888




##########
File path: docs/ingestion/index.md
##########
@@ -22,29 +22,20 @@ title: "Ingestion"
   ~ under the License.
   -->
 
-All data in Druid is organized into _segments_, which are data files each of 
which may have up to a few million rows.
-Loading data in Druid is called _ingestion_ or _indexing_, and consists of 
reading data from a source system and creating
-segments based on that data.
+Loading data in Druid is called _ingestion_ or _indexing_. When you ingest 
data into Druid, Druid reads the data from your source system and stores it in 
data files called _segments_. In general, segment files contain a few million 
rows.
 
-In most ingestion methods, the Druid 
[MiddleManager](../design/middlemanager.md) processes
-(or the [Indexer](../design/indexer.md) processes) load your source data. One 
exception is
-Hadoop-based ingestion, where this work is instead done using a Hadoop 
MapReduce job on YARN (although MiddleManager or Indexer
-processes are still involved in starting and monitoring the Hadoop jobs). 
+For most ingestion methods, the Druid 
[MiddleManager](../design/middlemanager.md) processes or the 
[Indexer](../design/indexer.md) processes load your source data. One exception 
is
+Hadoop-based ingestion, which uses a Hadoop MapReduce job on YARN 
MiddleManager or Indexer processes start and monitor Hadoop jobs. 
 
-Once segments have been generated and stored in [deep 
storage](../dependencies/deep-storage.md), they are loaded by Historical 
processes. 
-For more details on how this works, see the [Storage 
design](../design/architecture.md#storage-design) section 
-of Druid's design documentation.
+After Druid creates segments have been generated and stores them in [deep 
storage](../dependencies/deep-storage.md), Historical processes load them to 
respond to queries. See the [Storage 
design](../design/architecture.md#storage-design) section of the Druid design 
documentation for more information.

Review comment:
       ```suggestion
   After Druid creates segments and stores them in [deep 
storage](../dependencies/deep-storage.md), Historical processes load them to 
respond to queries. See the [Storage 
design](../design/architecture.md#storage-design) section of the Druid design 
documentation for more information.
   ```

##########
File path: docs/ingestion/index.md
##########
@@ -22,29 +22,20 @@ title: "Ingestion"
   ~ under the License.
   -->
 
-All data in Druid is organized into _segments_, which are data files each of 
which may have up to a few million rows.
-Loading data in Druid is called _ingestion_ or _indexing_, and consists of 
reading data from a source system and creating
-segments based on that data.
+Loading data in Druid is called _ingestion_ or _indexing_. When you ingest 
data into Druid, Druid reads the data from your source system and stores it in 
data files called _segments_. In general, segment files contain a few million 
rows.
 
-In most ingestion methods, the Druid 
[MiddleManager](../design/middlemanager.md) processes
-(or the [Indexer](../design/indexer.md) processes) load your source data. One 
exception is
-Hadoop-based ingestion, where this work is instead done using a Hadoop 
MapReduce job on YARN (although MiddleManager or Indexer
-processes are still involved in starting and monitoring the Hadoop jobs). 
+For most ingestion methods, the Druid 
[MiddleManager](../design/middlemanager.md) processes or the 
[Indexer](../design/indexer.md) processes load your source data. One exception 
is
+Hadoop-based ingestion, which uses a Hadoop MapReduce job on YARN 
MiddleManager or Indexer processes start and monitor Hadoop jobs. 

Review comment:
       ```suggestion
   Hadoop-based ingestion, which uses a Hadoop MapReduce job on YARN 
MiddleManager or Indexer processes to start and monitor Hadoop jobs. 
   ```

##########
File path: docs/ingestion/data-model.md
##########
@@ -0,0 +1,38 @@
+---
+id: data-model
+title: "Druid data model"
+sidebar_label: Data model
+description: Introduces concepts of datasources, primary timestamp, 
dimensions, and metrics.
+---
+
+Druid stores data in datasources, which are similar to tables in a traditional 
relational database management systems (RDBMS). Druid's data model shares  
similarities with both relational and timeseries data models.

Review comment:
       ```suggestion
   Druid stores data in datasources, which are similar to tables in a 
traditional relational database management system (RDBMS). Druid's data model 
shares  similarities with both relational and timeseries data models.
   ```

##########
File path: docs/ingestion/data-model.md
##########
@@ -0,0 +1,38 @@
+---
+id: data-model
+title: "Druid data model"
+sidebar_label: Data model
+description: Introduces concepts of datasources, primary timestamp, 
dimensions, and metrics.
+---
+
+Druid stores data in datasources, which are similar to tables in a traditional 
relational database management systems (RDBMS). Druid's data model shares  
similarities with both relational and timeseries data models.
+
+## Primary timestamp
+
+Druid schemas must always include a primary timestamp. Druid uses the primary 
timestamp to [partition and sort](./partitioning.md) your data. Druid uses the 
primary timestamp to rapidly identify and retrieve data within the time range 
of queries. Druid also uses the primary timestamp column
+for time-based [data management operations](./data-management.md) such as 
dropping time chunks, overwriting time chunks, and time-based retention rules.
+
+Druid parses the primary timestamp based on the 
[`timestampSpec`](./ingestion-spec.md#timestampspec) configuration at ingestion 
time. You can control other important operations that are based on the primary 
timestamp
+[`granularitySpec`](./ingestion-spec.md#granularityspec). Regardless of the 
source input field for the primary timestamp, Druid always stores the timestamp 
in the `__time` column in your Druid datasource.

Review comment:
       So the user can use _either_ `timestampSpec` or `granularitySpec` as a 
primary timestamp but not both?

##########
File path: docs/ingestion/rollup.md
##########
@@ -0,0 +1,61 @@
+---
+id: rollup
+title: "Data rollup"
+sidebar_label: Data rollup
+description: Introduces rollup as a concept. Provides suggestions to maximize 
the benefits of rollup. Differentiates between perfect and best-effort rollup.
+---
+Druid can roll up data at ingestion time to reduce the amount of raw data to  
store on disk. Rollup is a form of summarization or pre-aggregation. Rolling up 
data can dramatically reduce the size of data to be stored and reduce row 
counts by potentially orders of magnitude. As a trade off for the efficiency of 
rollup, you lose the ability to query individual events.
+
+At ingestion time, you control rollup with the `rollup` setting in the 
[`granularitySpec`](./ingestion-spec.md#granularityspec). Rollup is enabled by 
default. This means Druid combines into a single row any rows that have 
identical [dimension](./data-model.md#dimensions) values and 
[timestamp](./data-model.md#primary-timestamp) values after 
[`queryGranularity`-based truncation](./ingestion-spec.md#granularityspec).
+
+When you disable rollup, Druid loads each row as-is without doing any form of 
pre-aggregation. This mode is similar to databases that do not support a rollup 
feature. Set `rollup` to `false` if you want Druid to store each record as-is, 
without any rollup summarization.
+
+## Maximizing rollup ratio
+
+To measure the rollup ratio of a datasource, compare the number of rows in 
Druid with the number of ingested events. The higher this result, the more 
benefit you are gaining from rollup. For example you can run the following 
[Druid SQL](../querying/sql.md) query after ingestion:
+
+```sql
+SELECT SUM("cnt") / COUNT(*) * 1.0 FROM datasource
+```
+
+In this query, `cnt` refers to a "count" type metric from your ingestion spec. 
See
+[Counting the number of ingested events](schema-design.md#counting) on the 
"Schema design" page for more details about how counting works when rollup is 
enabled.
+
+Tips for maximizing rollup:
+
+- Design your schema with fewer dimensions and lower cardinality dimensions to 
yield better rollup ratios.
+- Use [sketches](schema-design.md#sketches) to avoid storing high cardinality 
dimensions, which decrease rollup ratios.
+- Adjust your `queryGranularity` at ingestion time to increase the chances 
that multiple rows in Druid having matching timestamps. For example, use five 
minute query granularity (`PT5M`) instead of one minute (`PT1M`).
+- You can optionally load the same data into more than one Druid datasource. 
For example:
+    - Create a "full" datasource that has rollup disabled, or enabled, but 
with a minimal rollup ratio

Review comment:
       ```suggestion
       - Create a "full" datasource that has rollup disabled, or enabled, but 
with a minimal rollup ratio.
   ```

##########
File path: docs/ingestion/rollup.md
##########
@@ -0,0 +1,61 @@
+---
+id: rollup
+title: "Data rollup"
+sidebar_label: Data rollup
+description: Introduces rollup as a concept. Provides suggestions to maximize 
the benefits of rollup. Differentiates between perfect and best-effort rollup.
+---
+Druid can roll up data at ingestion time to reduce the amount of raw data to  
store on disk. Rollup is a form of summarization or pre-aggregation. Rolling up 
data can dramatically reduce the size of data to be stored and reduce row 
counts by potentially orders of magnitude. As a trade off for the efficiency of 
rollup, you lose the ability to query individual events.
+
+At ingestion time, you control rollup with the `rollup` setting in the 
[`granularitySpec`](./ingestion-spec.md#granularityspec). Rollup is enabled by 
default. This means Druid combines into a single row any rows that have 
identical [dimension](./data-model.md#dimensions) values and 
[timestamp](./data-model.md#primary-timestamp) values after 
[`queryGranularity`-based truncation](./ingestion-spec.md#granularityspec).
+
+When you disable rollup, Druid loads each row as-is without doing any form of 
pre-aggregation. This mode is similar to databases that do not support a rollup 
feature. Set `rollup` to `false` if you want Druid to store each record as-is, 
without any rollup summarization.
+
+## Maximizing rollup ratio
+
+To measure the rollup ratio of a datasource, compare the number of rows in 
Druid with the number of ingested events. The higher this result, the more 
benefit you are gaining from rollup. For example you can run the following 
[Druid SQL](../querying/sql.md) query after ingestion:
+
+```sql
+SELECT SUM("cnt") / COUNT(*) * 1.0 FROM datasource
+```
+
+In this query, `cnt` refers to a "count" type metric from your ingestion spec. 
See
+[Counting the number of ingested events](schema-design.md#counting) on the 
"Schema design" page for more details about how counting works when rollup is 
enabled.
+
+Tips for maximizing rollup:
+
+- Design your schema with fewer dimensions and lower cardinality dimensions to 
yield better rollup ratios.
+- Use [sketches](schema-design.md#sketches) to avoid storing high cardinality 
dimensions, which decrease rollup ratios.
+- Adjust your `queryGranularity` at ingestion time to increase the chances 
that multiple rows in Druid having matching timestamps. For example, use five 
minute query granularity (`PT5M`) instead of one minute (`PT1M`).
+- You can optionally load the same data into more than one Druid datasource. 
For example:
+    - Create a "full" datasource that has rollup disabled, or enabled, but 
with a minimal rollup ratio
+    - Create a second "abbreviated" datasource with fewer dimensions and a 
higher rollup ratio.
+     When queries only involve dimensions in the "abbreviated" set, use the 
second datasource to reduce query times. Often, this method only requires a 
small increase in storage footprint because abbreviated datasources tend to be 
substantially smaller.
+- If you use a [best-effort rollup](#perfect-rollup-vs-best-effort-rollup) 
ingestion configuration that does not guarantee perfect rollup, try one of the 
following:
+    - Switch to a guaranteed perfect rollup option.
+    - [Reindex](data-management.md#reingesting-data) or 
[compact](compaction.md) your data in the background after initial ingestion.
+
+## Perfect rollup vs Best-effort rollup
+
+Depending on the ingestion method, Druid has the following rollup options:
+- Guaranteed _perfect rollup_: Druid perfectly aggregates input data at 
ingestion time.
+- _best-effort rollup_: Druid may not perfectly aggregate input data. 
Therefore, multiple segments might contain rows with the same timestamp and 
dimension values.

Review comment:
       ```suggestion
   ## Perfect rollup vs best-effort rollup
   
   Depending on the ingestion method, Druid has the following rollup options:
   - Guaranteed _perfect rollup_: Druid perfectly aggregates input data at 
ingestion time.
   - _Best-effort rollup_: Druid may not perfectly aggregate input data. 
Therefore, multiple segments might contain rows with the same timestamp and 
dimension values.
   ```

##########
File path: docs/ingestion/partitioning.md
##########
@@ -0,0 +1,50 @@
+---
+id: partitioning
+title: Partitioning
+sidebar_label: Partitioning
+description: Describes time chunk and secondary partitioning in Druid. 
Provides guidance to choose a secondary partition dimension.
+---
+
+Optimal partitioning and sorting of segments within your Druid datasources can 
have substantial impact on footprint and performance.
+
+One way to partition is to your load data into separate datasources. This is a 
perfectly viable approach that works very well when the number of datasources 
does not lead to excessive per-datasource overheads. 
+
+This topic describes how to set up partitions within a single datasource. It 
does not cover using multiple datasources. See [Multitenancy 
considerations](../querying/multitenancy.md) for more details on splitting data 
into separate datasources and potential operational considerations.

Review comment:
       ```suggestion
   This topic describes how to set up partitions within a single datasource. It 
does not cover how to use multiple datasources. See [Multitenancy 
considerations](../querying/multitenancy.md) for more details on splitting data 
into separate datasources and potential operational considerations.
   ```

##########
File path: docs/ingestion/partitioning.md
##########
@@ -0,0 +1,50 @@
+---
+id: partitioning
+title: Partitioning
+sidebar_label: Partitioning
+description: Describes time chunk and secondary partitioning in Druid. 
Provides guidance to choose a secondary partition dimension.
+---
+
+Optimal partitioning and sorting of segments within your Druid datasources can 
have substantial impact on footprint and performance.
+
+One way to partition is to your load data into separate datasources. This is a 
perfectly viable approach that works very well when the number of datasources 
does not lead to excessive per-datasource overheads. 
+
+This topic describes how to set up partitions within a single datasource. It 
does not cover using multiple datasources. See [Multitenancy 
considerations](../querying/multitenancy.md) for more details on splitting data 
into separate datasources and potential operational considerations.
+
+## Time chunk partitioning
+
+Druid always partitions datasources by time into _time chunks_. Each time 
chunk contains one or more segments. This partitioning happens for all 
ingestion methods based on the `segmentGranularity` parameter in your ingestion 
spec `dataSchema` object.
+
+## Secondary partitioning
+
+Druid can partition segments within a particular time chunk further depending 
upon options that vary based on the ingestion type you have chosen. In general, 
secondary partitioning on a particular dimension improves locality. This means 
that rows with the same value for that dimension are stored together, 
decreasing access time.
+
+To achieve the best performance and smallest overall footprint, partition your 
data on a "natural"
+dimension that you often use as a filter when possible. Such partitioning 
often improves compression and query performance. For example, some cases have 
yielded threefold storage size decreases.
+
+## Partitioning and sorting
+
+Partitioning and sorting work well together. If you do have a "natural" 
partitioning dimension, consider placing it first in the `dimensions` list of 
your `dimensionsSpec`. This way Druid sorts rows within each segment by that 
column. This sorting configuration frequently improves compression more than 
using partitioning alone.
+
+> Note that Druid always sorts rows within a segment by timestamp first, even 
before the first dimension listed in your `dimensionsSpec`. This sorting can 
preclude the efficacy of dimension sorting. To work around this limitation if 
necessary, set your `queryGranularity` equal to `segmentGranularity` in your 
[`granularitySpec`](./ingestion-spec.md#granularityspec). Druid will set all 
timestamps within the segment to the same value, and letting you identify a 
[secondary timestamp](schema-design.md#secondary-timestamps) as the "real" 
timestamp.
+
+## How to configure partitioning
+
+Not all ingestion methods support an explicit partitioning configuration, and 
not all have equivalent levels of flexibility. If you are doing initial 
ingestion through a less-flexible method like
+Kafka), you can use [reindexing](data-management.md#reingesting-data) or 
[compaction](compaction.md) to repartition your data after initial ingestion. 
This is a powerful technique you can use to optimally partition any data older 
than a certain even while you continuously add new data from a stream.
+
+The following table shows how each ingestion method handles partitioning:
+
+|Method|How it works|
+|------|------------|
+|[Native batch](native-batch.md)|Configured using 
[`partitionsSpec`](native-batch.md#partitionsspec) inside the `tuningConfig`.|
+|[Hadoop](hadoop.md)|Configured using 
[`partitionsSpec`](hadoop.md#partitionsspec) inside the `tuningConfig`.|
+|[Kafka indexing 
service](../development/extensions-core/kafka-ingestion.md)|Kafka topic 
partitioning defines how partitions the datasource. You can also 
[reindex](data-management.md#reingesting-data) or [compact](compaction.md) to 
repartition after initial ingestion.|
+|[Kinesis indexing 
service](../development/extensions-core/kinesis-ingestion.md)|Kinesis stream 
sharding defines how partitions the datasource.. You can also 
[reindex](data-management.md#reingesting-data) or [compact](compaction.md) to 
repartition after initial ingestion.|

Review comment:
       "defines how partitions the datasource" sounds unclear

##########
File path: docs/querying/multi-value-dimensions.md
##########
@@ -1,4 +1,4 @@
----
+  ---

Review comment:
       ```suggestion
   ---
   ```

##########
File path: docs/ingestion/partitioning.md
##########
@@ -0,0 +1,50 @@
+---
+id: partitioning
+title: Partitioning
+sidebar_label: Partitioning
+description: Describes time chunk and secondary partitioning in Druid. 
Provides guidance to choose a secondary partition dimension.
+---
+
+Optimal partitioning and sorting of segments within your Druid datasources can 
have substantial impact on footprint and performance.

Review comment:
       phrasing seems a bit strange; another option could be "partitioning of 
and sorting segments" though not much better

##########
File path: docs/ingestion/rollup.md
##########
@@ -0,0 +1,61 @@
+---
+id: rollup
+title: "Data rollup"
+sidebar_label: Data rollup
+description: Introduces rollup as a concept. Provides suggestions to maximize 
the benefits of rollup. Differentiates between perfect and best-effort rollup.
+---
+Druid can roll up data at ingestion time to reduce the amount of raw data to  
store on disk. Rollup is a form of summarization or pre-aggregation. Rolling up 
data can dramatically reduce the size of data to be stored and reduce row 
counts by potentially orders of magnitude. As a trade off for the efficiency of 
rollup, you lose the ability to query individual events.

Review comment:
       ```suggestion
   Druid can roll up data at ingestion time to reduce the amount of raw data to 
 store on disk. Rollup is a form of summarization or pre-aggregation. Rolling 
up data can dramatically reduce the size of data to be stored and reduce row 
counts by potentially orders of magnitude. As a trade-off for the efficiency of 
rollup, you lose the ability to query individual events.
   ```

##########
File path: docs/ingestion/data-model.md
##########
@@ -0,0 +1,38 @@
+---
+id: data-model
+title: "Druid data model"
+sidebar_label: Data model
+description: Introduces concepts of datasources, primary timestamp, 
dimensions, and metrics.
+---
+
+Druid stores data in datasources, which are similar to tables in a traditional 
relational database management systems (RDBMS). Druid's data model shares  
similarities with both relational and timeseries data models.
+
+## Primary timestamp
+
+Druid schemas must always include a primary timestamp. Druid uses the primary 
timestamp to [partition and sort](./partitioning.md) your data. Druid uses the 
primary timestamp to rapidly identify and retrieve data within the time range 
of queries. Druid also uses the primary timestamp column
+for time-based [data management operations](./data-management.md) such as 
dropping time chunks, overwriting time chunks, and time-based retention rules.
+
+Druid parses the primary timestamp based on the 
[`timestampSpec`](./ingestion-spec.md#timestampspec) configuration at ingestion 
time. You can control other important operations that are based on the primary 
timestamp
+[`granularitySpec`](./ingestion-spec.md#granularityspec). Regardless of the 
source input field for the primary timestamp, Druid always stores the timestamp 
in the `__time` column in your Druid datasource.
+
+If you have more than one timestamp column, you can store the others as
+[secondary timestamps](./schema-design.md#secondary-timestamps).
+
+## Dimensions
+
+Dimensions are columns that Druid stores "as-is". You can use dimensions for 
any purpose. For example, you can group, filter, or apply aggregators to 
dimensions at query time in an ad-hoc manner.
+
+If you disable [rollup](./rollup.md), then Druid treats the set of
+dimensions like a set of columns to ingest. The dimensions behave exactly as 
you would expect from any database that does not support a rollup feature.
+
+At ingestion time, you configure dimensions in the 
[`dimensionsSpec`](./ingestion-spec.md#dimensionsspec).
+
+## Metrics
+
+Metrics are columns that Druid stores in an aggregated form. Metrics are most 
useful when you enable [rollup](rollup.md). If you Specify a metric, you can 
apply an aggregation function to each row during ingestion. This

Review comment:
       ```suggestion
   Metrics are columns that Druid stores in an aggregated form. Metrics are 
most useful when you enable [rollup](rollup.md). If you specify a metric, you 
can apply an aggregation function to each row during ingestion. This
   ```

##########
File path: docs/ingestion/partitioning.md
##########
@@ -0,0 +1,50 @@
+---
+id: partitioning
+title: Partitioning
+sidebar_label: Partitioning
+description: Describes time chunk and secondary partitioning in Druid. 
Provides guidance to choose a secondary partition dimension.
+---
+
+Optimal partitioning and sorting of segments within your Druid datasources can 
have substantial impact on footprint and performance.
+
+One way to partition is to your load data into separate datasources. This is a 
perfectly viable approach that works very well when the number of datasources 
does not lead to excessive per-datasource overheads. 
+
+This topic describes how to set up partitions within a single datasource. It 
does not cover using multiple datasources. See [Multitenancy 
considerations](../querying/multitenancy.md) for more details on splitting data 
into separate datasources and potential operational considerations.
+
+## Time chunk partitioning
+
+Druid always partitions datasources by time into _time chunks_. Each time 
chunk contains one or more segments. This partitioning happens for all 
ingestion methods based on the `segmentGranularity` parameter in your ingestion 
spec `dataSchema` object.
+
+## Secondary partitioning
+
+Druid can partition segments within a particular time chunk further depending 
upon options that vary based on the ingestion type you have chosen. In general, 
secondary partitioning on a particular dimension improves locality. This means 
that rows with the same value for that dimension are stored together, 
decreasing access time.
+
+To achieve the best performance and smallest overall footprint, partition your 
data on a "natural"
+dimension that you often use as a filter when possible. Such partitioning 
often improves compression and query performance. For example, some cases have 
yielded threefold storage size decreases.
+
+## Partitioning and sorting
+
+Partitioning and sorting work well together. If you do have a "natural" 
partitioning dimension, consider placing it first in the `dimensions` list of 
your `dimensionsSpec`. This way Druid sorts rows within each segment by that 
column. This sorting configuration frequently improves compression more than 
using partitioning alone.
+
+> Note that Druid always sorts rows within a segment by timestamp first, even 
before the first dimension listed in your `dimensionsSpec`. This sorting can 
preclude the efficacy of dimension sorting. To work around this limitation if 
necessary, set your `queryGranularity` equal to `segmentGranularity` in your 
[`granularitySpec`](./ingestion-spec.md#granularityspec). Druid will set all 
timestamps within the segment to the same value, and letting you identify a 
[secondary timestamp](schema-design.md#secondary-timestamps) as the "real" 
timestamp.
+
+## How to configure partitioning
+
+Not all ingestion methods support an explicit partitioning configuration, and 
not all have equivalent levels of flexibility. If you are doing initial 
ingestion through a less-flexible method like
+Kafka), you can use [reindexing](data-management.md#reingesting-data) or 
[compaction](compaction.md) to repartition your data after initial ingestion. 
This is a powerful technique you can use to optimally partition any data older 
than a certain even while you continuously add new data from a stream.

Review comment:
       "older than a certain **event**" or "older than a certain **X**, even 
while" ?

##########
File path: docs/ingestion/partitioning.md
##########
@@ -0,0 +1,50 @@
+---
+id: partitioning
+title: Partitioning
+sidebar_label: Partitioning
+description: Describes time chunk and secondary partitioning in Druid. 
Provides guidance to choose a secondary partition dimension.
+---
+
+Optimal partitioning and sorting of segments within your Druid datasources can 
have substantial impact on footprint and performance.
+
+One way to partition is to your load data into separate datasources. This is a 
perfectly viable approach that works very well when the number of datasources 
does not lead to excessive per-datasource overheads. 
+
+This topic describes how to set up partitions within a single datasource. It 
does not cover using multiple datasources. See [Multitenancy 
considerations](../querying/multitenancy.md) for more details on splitting data 
into separate datasources and potential operational considerations.
+
+## Time chunk partitioning
+
+Druid always partitions datasources by time into _time chunks_. Each time 
chunk contains one or more segments. This partitioning happens for all 
ingestion methods based on the `segmentGranularity` parameter in your ingestion 
spec `dataSchema` object.
+
+## Secondary partitioning
+
+Druid can partition segments within a particular time chunk further depending 
upon options that vary based on the ingestion type you have chosen. In general, 
secondary partitioning on a particular dimension improves locality. This means 
that rows with the same value for that dimension are stored together, 
decreasing access time.
+
+To achieve the best performance and smallest overall footprint, partition your 
data on a "natural"
+dimension that you often use as a filter when possible. Such partitioning 
often improves compression and query performance. For example, some cases have 
yielded threefold storage size decreases.
+
+## Partitioning and sorting
+
+Partitioning and sorting work well together. If you do have a "natural" 
partitioning dimension, consider placing it first in the `dimensions` list of 
your `dimensionsSpec`. This way Druid sorts rows within each segment by that 
column. This sorting configuration frequently improves compression more than 
using partitioning alone.
+
+> Note that Druid always sorts rows within a segment by timestamp first, even 
before the first dimension listed in your `dimensionsSpec`. This sorting can 
preclude the efficacy of dimension sorting. To work around this limitation if 
necessary, set your `queryGranularity` equal to `segmentGranularity` in your 
[`granularitySpec`](./ingestion-spec.md#granularityspec). Druid will set all 
timestamps within the segment to the same value, and letting you identify a 
[secondary timestamp](schema-design.md#secondary-timestamps) as the "real" 
timestamp.

Review comment:
       ```suggestion
   > Note that Druid always sorts rows within a segment by timestamp first, 
even before the first dimension listed in your `dimensionsSpec`. This sorting 
can preclude the efficacy of dimension sorting. To work around this limitation 
if necessary, set your `queryGranularity` equal to `segmentGranularity` in your 
[`granularitySpec`](./ingestion-spec.md#granularityspec). Druid will set all 
timestamps within the segment to the same value, letting you identify a 
[secondary timestamp](schema-design.md#secondary-timestamps) as the "real" 
timestamp.
   ```

##########
File path: docs/ingestion/data-model.md
##########
@@ -0,0 +1,38 @@
+---
+id: data-model
+title: "Druid data model"
+sidebar_label: Data model
+description: Introduces concepts of datasources, primary timestamp, 
dimensions, and metrics.
+---
+
+Druid stores data in datasources, which are similar to tables in a traditional 
relational database management systems (RDBMS). Druid's data model shares  
similarities with both relational and timeseries data models.
+
+## Primary timestamp
+
+Druid schemas must always include a primary timestamp. Druid uses the primary 
timestamp to [partition and sort](./partitioning.md) your data. Druid uses the 
primary timestamp to rapidly identify and retrieve data within the time range 
of queries. Druid also uses the primary timestamp column
+for time-based [data management operations](./data-management.md) such as 
dropping time chunks, overwriting time chunks, and time-based retention rules.
+
+Druid parses the primary timestamp based on the 
[`timestampSpec`](./ingestion-spec.md#timestampspec) configuration at ingestion 
time. You can control other important operations that are based on the primary 
timestamp
+[`granularitySpec`](./ingestion-spec.md#granularityspec). Regardless of the 
source input field for the primary timestamp, Druid always stores the timestamp 
in the `__time` column in your Druid datasource.
+
+If you have more than one timestamp column, you can store the others as
+[secondary timestamps](./schema-design.md#secondary-timestamps).
+
+## Dimensions
+
+Dimensions are columns that Druid stores "as-is". You can use dimensions for 
any purpose. For example, you can group, filter, or apply aggregators to 
dimensions at query time in an ad-hoc manner.

Review comment:
       ```suggestion
   Dimensions are columns that Druid stores "as-is". You can use dimensions for 
any purpose. For example, you can group, filter, or apply aggregators to 
dimensions at query time in an ad hoc manner.
   ```

##########
File path: docs/ingestion/rollup.md
##########
@@ -0,0 +1,61 @@
+---
+id: rollup
+title: "Data rollup"
+sidebar_label: Data rollup
+description: Introduces rollup as a concept. Provides suggestions to maximize 
the benefits of rollup. Differentiates between perfect and best-effort rollup.
+---
+Druid can roll up data at ingestion time to reduce the amount of raw data to  
store on disk. Rollup is a form of summarization or pre-aggregation. Rolling up 
data can dramatically reduce the size of data to be stored and reduce row 
counts by potentially orders of magnitude. As a trade off for the efficiency of 
rollup, you lose the ability to query individual events.
+
+At ingestion time, you control rollup with the `rollup` setting in the 
[`granularitySpec`](./ingestion-spec.md#granularityspec). Rollup is enabled by 
default. This means Druid combines into a single row any rows that have 
identical [dimension](./data-model.md#dimensions) values and 
[timestamp](./data-model.md#primary-timestamp) values after 
[`queryGranularity`-based truncation](./ingestion-spec.md#granularityspec).
+
+When you disable rollup, Druid loads each row as-is without doing any form of 
pre-aggregation. This mode is similar to databases that do not support a rollup 
feature. Set `rollup` to `false` if you want Druid to store each record as-is, 
without any rollup summarization.
+
+## Maximizing rollup ratio
+
+To measure the rollup ratio of a datasource, compare the number of rows in 
Druid with the number of ingested events. The higher this result, the more 
benefit you are gaining from rollup. For example you can run the following 
[Druid SQL](../querying/sql.md) query after ingestion:
+
+```sql
+SELECT SUM("cnt") / COUNT(*) * 1.0 FROM datasource
+```
+
+In this query, `cnt` refers to a "count" type metric from your ingestion spec. 
See
+[Counting the number of ingested events](schema-design.md#counting) on the 
"Schema design" page for more details about how counting works when rollup is 
enabled.
+
+Tips for maximizing rollup:
+
+- Design your schema with fewer dimensions and lower cardinality dimensions to 
yield better rollup ratios.
+- Use [sketches](schema-design.md#sketches) to avoid storing high cardinality 
dimensions, which decrease rollup ratios.
+- Adjust your `queryGranularity` at ingestion time to increase the chances 
that multiple rows in Druid having matching timestamps. For example, use five 
minute query granularity (`PT5M`) instead of one minute (`PT1M`).
+- You can optionally load the same data into more than one Druid datasource. 
For example:
+    - Create a "full" datasource that has rollup disabled, or enabled, but 
with a minimal rollup ratio
+    - Create a second "abbreviated" datasource with fewer dimensions and a 
higher rollup ratio.
+     When queries only involve dimensions in the "abbreviated" set, use the 
second datasource to reduce query times. Often, this method only requires a 
small increase in storage footprint because abbreviated datasources tend to be 
substantially smaller.
+- If you use a [best-effort rollup](#perfect-rollup-vs-best-effort-rollup) 
ingestion configuration that does not guarantee perfect rollup, try one of the 
following:
+    - Switch to a guaranteed perfect rollup option.
+    - [Reindex](data-management.md#reingesting-data) or 
[compact](compaction.md) your data in the background after initial ingestion.
+
+## Perfect rollup vs Best-effort rollup
+
+Depending on the ingestion method, Druid has the following rollup options:
+- Guaranteed _perfect rollup_: Druid perfectly aggregates input data at 
ingestion time.
+- _best-effort rollup_: Druid may not perfectly aggregate input data. 
Therefore, multiple segments might contain rows with the same timestamp and 
dimension values.
+
+In general, ingestion methods that offer best-effort rollup do this for one of 
the following reasons:
+- The ingestion method parallelizes ingestion without a shuffling step 
required for perfect rollup.
+- The ingestion method uses _incremental publishing_ which means it finalizes 
and publishes segments before all data for a time chunk has been received,

Review comment:
       ```suggestion
   - The ingestion method uses _incremental publishing_ which means it 
finalizes and publishes segments before all data for a time chunk has been 
received.
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] vtlim commented on a change in pull request #11541: Docs Ingestion page refactor

Reply via email to