[GitHub] [druid] techdocsmith commented on a change in pull request #11541: Docs Ingestion page refactor

GitBox Thu, 05 Aug 2021 14:09:39 -0700


techdocsmith commented on a change in pull request #11541:
URL: https://github.com/apache/druid/pull/11541#discussion_r683789237




##########
File path: docs/ingestion/index.md
##########
@@ -88,656 +79,5 @@ This table compares the three available options:
 | **External dependencies** | None. | Hadoop cluster (Druid submits Map/Reduce 
jobs). | None. |
 | **Input locations** | Any [`inputSource`](./native-batch.md#input-sources). 
| Any Hadoop FileSystem or Druid datasource. | Any 
[`inputSource`](./native-batch.md#input-sources). |
 | **File formats** | Any [`inputFormat`](./data-formats.md#input-format). | 
Any Hadoop InputFormat. | Any [`inputFormat`](./data-formats.md#input-format). |
-| **[Rollup modes](#rollup)** | Perfect if `forceGuaranteedRollup` = true in 
the [`tuningConfig`](native-batch.md#tuningconfig).  | Always perfect. | 
Perfect if `forceGuaranteedRollup` = true in the 
[`tuningConfig`](native-batch.md#tuningconfig). |
-| **Partitioning options** | Dynamic, hash-based, and range-based partitioning 
methods are available. See [Partitions Spec](./native-batch.md#partitionsspec) 
for details. | Hash-based or range-based partitioning via 
[`partitionsSpec`](hadoop.md#partitionsspec). | Dynamic and hash-based 
partitioning methods are available. See [Partitions 
Spec](./native-batch.md#partitionsspec-1) for details. |
-
-<a name="data-model"></a>
-
-## Druid's data model
-
-### Datasources
-
-Druid data is stored in datasources, which are similar to tables in a 
traditional RDBMS. Druid
-offers a unique data modeling system that bears similarity to both relational 
and timeseries models.
-
-### Primary timestamp
-
-Druid schemas must always include a primary timestamp. The primary timestamp 
is used for
-[partitioning and sorting](#partitioning) your data. Druid queries are able to 
rapidly identify and retrieve data
-corresponding to time ranges of the primary timestamp column. Druid is also 
able to use the primary timestamp column
-for time-based [data management operations](data-management.md) such as 
dropping time chunks, overwriting time chunks,
-and time-based retention rules.
-
-The primary timestamp is parsed based on the 
[`timestampSpec`](#timestampspec). In addition, the
-[`granularitySpec`](#granularityspec) controls other important operations that 
are based on the primary timestamp.
-Regardless of which input field the primary timestamp is read from, it will 
always be stored as a column named `__time`
-in your Druid datasource.
-
-If you have more than one timestamp column, you can store the others as
-[secondary timestamps](schema-design.md#secondary-timestamps).
-
-### Dimensions
-
-Dimensions are columns that are stored as-is and can be used for any purpose. 
You can group, filter, or apply
-aggregators to dimensions at query time in an ad-hoc manner. If you run with 
[rollup](#rollup) disabled, then the set of
-dimensions is simply treated like a set of columns to ingest, and behaves 
exactly as you would expect from a typical
-database that does not support a rollup feature.
-
-Dimensions are configured through the [`dimensionsSpec`](#dimensionsspec).
-
-### Metrics
-
-Metrics are columns that are stored in an aggregated form. They are most 
useful when [rollup](#rollup) is enabled.
-Specifying a metric allows you to choose an aggregation function for Druid to 
apply to each row during ingestion. This
-has two benefits:
-
-1. If [rollup](#rollup) is enabled, multiple rows can be collapsed into one 
row even while retaining summary
-information. In the [rollup tutorial](../tutorials/tutorial-rollup.md), this 
is used to collapse netflow data to a
-single row per `(minute, srcIP, dstIP)` tuple, while retaining aggregate 
information about total packet and byte counts.
-2. Some aggregators, especially approximate ones, can be computed faster at 
query time even on non-rolled-up data if
-they are partially computed at ingestion time.
-
-Metrics are configured through the [`metricsSpec`](#metricsspec).
-
-## Rollup
-
-### What is rollup?
-
-Druid can roll up data as it is ingested to minimize the amount of raw data 
that needs to be stored. Rollup is
-a form of summarization or pre-aggregation. In practice, rolling up data can 
dramatically reduce the size of data that
-needs to be stored, reducing row counts by potentially orders of magnitude. 
This storage reduction does come at a cost:
-as we roll up data, we lose the ability to query individual events.
-
-When rollup is disabled, Druid loads each row as-is without doing any form of 
pre-aggregation. This mode is similar
-to what you would expect from a typical database that does not support a 
rollup feature.
-
-When rollup is enabled, then any rows that have identical 
[dimensions](#dimensions) and [timestamp](#primary-timestamp)
-to each other (after [`queryGranularity`-based truncation](#granularityspec)) 
can be collapsed, or _rolled up_, into a
-single row in Druid.
-
-By default, rollup is enabled.
-
-### Enabling or disabling rollup
-
-Rollup is controlled by the `rollup` setting in the 
[`granularitySpec`](#granularityspec). By default, it is `true`
-(enabled). Set this to `false` if you want Druid to store each record as-is, 
without any rollup summarization.
-
-### Example of rollup
-
-For an example of how to configure rollup, and of how the feature will modify 
your data, check out the
-[rollup tutorial](../tutorials/tutorial-rollup.md).
-
-### Maximizing rollup ratio
-
-You can measure the rollup ratio of a datasource by comparing the number of 
rows in Druid (`COUNT`) with the number of ingested
-events.  One way to do this is with a
-[Druid SQL](../querying/sql.md) query such as the following, where "count" 
refers to a `count`-type metric generated at ingestion time:
-
-```sql
-SELECT SUM("count") / (COUNT(*) * 1.0)
-FROM datasource
-```
-
-The higher this number is, the more benefit you are gaining from rollup.
-
-> See [Counting the number of ingested events](schema-design.md#counting) on 
the "Schema design" page for more details about
-how counting works when rollup is enabled.
-
-Tips for maximizing rollup:
-
-- Generally, the fewer dimensions you have, and the lower the cardinality of 
your dimensions, the better rollup ratios
-you will achieve.
-- Use [sketches](schema-design.md#sketches) to avoid storing high cardinality 
dimensions, which harm rollup ratios.
-- Adjusting `queryGranularity` at ingestion time (for example, using `PT5M` 
instead of `PT1M`) increases the
-likelihood of two rows in Druid having matching timestamps, and can improve 
your rollup ratios.
-- It can be beneficial to load the same data into more than one Druid 
datasource. Some users choose to create a "full"
-datasource that has rollup disabled (or enabled, but with a minimal rollup 
ratio) and an "abbreviated" datasource that
-has fewer dimensions and a higher rollup ratio. When queries only involve 
dimensions in the "abbreviated" set, using
-that datasource leads to much faster query times. This can often be done with 
just a small increase in storage
-footprint, since abbreviated datasources tend to be substantially smaller.
-- If you are using a [best-effort 
rollup](#perfect-rollup-vs-best-effort-rollup) ingestion configuration that 
does not guarantee perfect
-rollup, you can potentially improve your rollup ratio by switching to a 
guaranteed perfect rollup option, or by
-[reindexing](data-management.md#reingesting-data) or 
[compacting](compaction.md) your data in the background after initial ingestion.
-
-### Perfect rollup vs Best-effort rollup
-
-Some Druid ingestion methods guarantee _perfect rollup_, meaning that input 
data are perfectly aggregated at ingestion
-time. Others offer _best-effort rollup_, meaning that input data might not be 
perfectly aggregated and thus there could
-be multiple segments holding rows with the same timestamp and dimension values.
-
-In general, ingestion methods that offer best-effort rollup do this because 
they are either parallelizing ingestion
-without a shuffling step (which would be required for perfect rollup), or 
because they are finalizing and publishing
-segments before all data for a time chunk has been received, which we call 
_incremental publishing_. In both of these
-cases, records that could theoretically be rolled up may end up in different 
segments. All types of streaming ingestion
-run in this mode.
-
-Ingestion methods that guarantee perfect rollup do it with an additional 
preprocessing step to determine intervals
-and partitioning before the actual data ingestion stage. This preprocessing 
step scans the entire input dataset, which
-generally increases the time required for ingestion, but provides information 
necessary for perfect rollup.
-
-The following table shows how each method handles rollup:
-
-|Method|How it works|
-|------|------------|
-|[Native batch](native-batch.md)|`index_parallel` and `index` type may be 
either perfect or best-effort, based on configuration.|
-|[Hadoop](hadoop.md)|Always perfect.|
-|[Kafka indexing 
service](../development/extensions-core/kafka-ingestion.md)|Always best-effort.|
-|[Kinesis indexing 
service](../development/extensions-core/kinesis-ingestion.md)|Always 
best-effort.|
-
-## Partitioning
-
-### Why partition?
-
-Optimal partitioning and sorting of segments within your datasources can have 
substantial impact on footprint and
-performance.
-
-Druid datasources are always partitioned by time into _time chunks_, and each 
time chunk contains one or more segments.
-This partitioning happens for all ingestion methods, and is based on the 
`segmentGranularity` parameter of your
-ingestion spec's `dataSchema`.
-
-The segments within a particular time chunk may also be partitioned further, 
using options that vary based on the
-ingestion type you have chosen. In general, doing this secondary partitioning 
using a particular dimension will
-improve locality, meaning that rows with the same value for that dimension are 
stored together and can be accessed
-quickly.
-
-You will usually get the best performance and smallest overall footprint by 
partitioning your data on some "natural"
-dimension that you often filter by, if one exists. This will often improve 
compression - users have reported threefold
-storage size decreases - and it also tends to improve query performance as 
well.
-
-> Partitioning and sorting are best friends! If you do have a "natural" 
partitioning dimension, you should also consider
-> placing it first in the `dimensions` list of your `dimensionsSpec`, which 
tells Druid to sort rows within each segment
-> by that column. This will often improve compression even more, beyond the 
improvement gained by partitioning alone.
->
-> However, note that currently, Druid always sorts rows within a segment by 
timestamp first, even before the first
-> dimension listed in your `dimensionsSpec`. This can prevent dimension 
sorting from being maximally effective. If
-> necessary, you can work around this limitation by setting `queryGranularity` 
equal to `segmentGranularity` in your
-> [`granularitySpec`](#granularityspec), which will set all timestamps within 
the segment to the same value, and by saving
-> your "real" timestamp as a [secondary 
timestamp](schema-design.md#secondary-timestamps). This limitation may be 
removed
-> in a future version of Druid.
-
-### How to set up partitioning
-
-Not all ingestion methods support an explicit partitioning configuration, and 
not all have equivalent levels of
-flexibility. As of current Druid versions, If you are doing initial ingestion 
through a less-flexible method (like
-Kafka) then you can use [reindexing](data-management.md#reingesting-data) or 
[compaction](compaction.md) to repartition your data after it
-is initially ingested. This is a powerful technique: you can use it to ensure 
that any data older than a certain
-threshold is optimally partitioned, even as you continuously add new data from 
a stream.
-
-The following table shows how each ingestion method handles partitioning:
-
-|Method|How it works|
-|------|------------|
-|[Native batch](native-batch.md)|Configured using 
[`partitionsSpec`](native-batch.md#partitionsspec) inside the `tuningConfig`.|
-|[Hadoop](hadoop.md)|Configured using 
[`partitionsSpec`](hadoop.md#partitionsspec) inside the `tuningConfig`.|
-|[Kafka indexing 
service](../development/extensions-core/kafka-ingestion.md)|Partitioning in 
Druid is guided by how your Kafka topic is partitioned. You can also 
[reindex](data-management.md#reingesting-data) or [compact](compaction.md) to 
repartition after initial ingestion.|
-|[Kinesis indexing 
service](../development/extensions-core/kinesis-ingestion.md)|Partitioning in 
Druid is guided by how your Kinesis stream is sharded. You can also 
[reindex](data-management.md#reingesting-data) or [compact](compaction.md) to 
repartition after initial ingestion.|
-
-> Note that, of course, one way to partition data is to load it into separate 
datasources. This is a perfectly viable
-> approach and works very well when the number of datasources does not lead to 
excessive per-datasource overheads. If
-> you go with this approach, then you can ignore this section, since it is 
describing how to set up partitioning
-> _within a single datasource_.
->
-> For more details on splitting data up into separate datasources, and 
potential operational considerations, refer
-> to the [Multitenancy considerations](../querying/multitenancy.md) page.
-
-<a name="spec"></a>
-
-## Ingestion specs
-
-No matter what ingestion method you use, data is loaded into Druid using 
either one-time [tasks](tasks.md) or
-ongoing "supervisors" (which run and supervise a set of tasks over time). In 
any case, part of the task or supervisor
-definition is an _ingestion spec_.
-
-Ingestion specs consists of three main components:
-
-- [`dataSchema`](#dataschema), which configures the [datasource 
name](#datasource),
-   [primary timestamp](#timestampspec), [dimensions](#dimensionsspec), 
[metrics](#metricsspec), and [transforms and filters](#transformspec) (if 
needed).
-- [`ioConfig`](#ioconfig), which tells Druid how to connect to the source 
system and how to parse data. For more information, see the
-   documentation for each [ingestion method](#ingestion-methods).
-- [`tuningConfig`](#tuningconfig), which controls various tuning parameters 
specific to each
-  [ingestion method](#ingestion-methods).
-
-Example ingestion spec for task type `index_parallel` (native batch):
-
-```
-{
-  "type": "index_parallel",
-  "spec": {
-    "dataSchema": {
-      "dataSource": "wikipedia",
-      "timestampSpec": {
-        "column": "timestamp",
-        "format": "auto"
-      },
-      "dimensionsSpec": {
-        "dimensions": [
-          "page",
-          "language",
-          { "type": "long", "name": "userId" }
-        ]
-      },
-      "metricsSpec": [
-        { "type": "count", "name": "count" },
-        { "type": "doubleSum", "name": "bytes_added_sum", "fieldName": 
"bytes_added" },
-        { "type": "doubleSum", "name": "bytes_deleted_sum", "fieldName": 
"bytes_deleted" }
-      ],
-      "granularitySpec": {
-        "segmentGranularity": "day",
-        "queryGranularity": "none",
-        "intervals": [
-          "2013-08-31/2013-09-01"
-        ]
-      }
-    },
-    "ioConfig": {
-      "type": "index_parallel",
-      "inputSource": {
-        "type": "local",
-        "baseDir": "examples/indexing/",
-        "filter": "wikipedia_data.json"
-      },
-      "inputFormat": {
-        "type": "json",
-        "flattenSpec": {
-          "useFieldDiscovery": true,
-          "fields": [
-            { "type": "path", "name": "userId", "expr": "$.user.id" }
-          ]
-        }
-      }
-    },
-    "tuningConfig": {
-      "type": "index_parallel"
-    }
-  }
-}
-```
-
-The specific options supported by these sections will depend on the [ingestion 
method](#ingestion-methods) you have chosen.
-For more examples, refer to the documentation for each ingestion method.
-
-You can also load data visually, without the need to write an ingestion spec, 
using the "Load data" functionality
-available in Druid's [web console](../operations/druid-console.md). Druid's 
visual data loader supports
-[Kafka](../development/extensions-core/kafka-ingestion.md),
-[Kinesis](../development/extensions-core/kinesis-ingestion.md), and
-[native batch](native-batch.md) mode.
-
-## `dataSchema`
-
-> The `dataSchema` spec has been changed in 0.17.0. The new spec is supported 
by all ingestion methods
-except for _Hadoop_ ingestion. See the [Legacy `dataSchema` 
spec](#legacy-dataschema-spec) for the old spec.
-
-The `dataSchema` is a holder for the following components:
-
-- [datasource name](#datasource), [primary timestamp](#timestampspec),
-  [dimensions](#dimensionsspec), [metrics](#metricsspec), and 
-  [transforms and filters](#transformspec) (if needed).
-
-An example `dataSchema` is:
-
-```
-"dataSchema": {
-  "dataSource": "wikipedia",
-  "timestampSpec": {
-    "column": "timestamp",
-    "format": "auto"
-  },
-  "dimensionsSpec": {
-    "dimensions": [
-      "page",
-      "language",
-      { "type": "long", "name": "userId" }
-    ]
-  },
-  "metricsSpec": [
-    { "type": "count", "name": "count" },
-    { "type": "doubleSum", "name": "bytes_added_sum", "fieldName": 
"bytes_added" },
-    { "type": "doubleSum", "name": "bytes_deleted_sum", "fieldName": 
"bytes_deleted" }
-  ],
-  "granularitySpec": {
-    "segmentGranularity": "day",
-    "queryGranularity": "none",
-    "intervals": [
-      "2013-08-31/2013-09-01"
-    ]
-  }
-}
-```
-
-### `dataSource`
-
-The `dataSource` is located in `dataSchema` → `dataSource` and is simply the 
name of the
-[datasource](../design/architecture.md#datasources-and-segments) that data 
will be written to. An example
-`dataSource` is:
-
-```
-"dataSource": "my-first-datasource"
-```
-
-### `timestampSpec`
-
-The `timestampSpec` is located in `dataSchema` → `timestampSpec` and is 
responsible for
-configuring the [primary timestamp](#primary-timestamp). An example 
`timestampSpec` is:
-
-```
-"timestampSpec": {
-  "column": "timestamp",
-  "format": "auto"
-}
-```
-
-> Conceptually, after input data records are read, Druid applies ingestion 
spec components in a particular order:
-> first [`flattenSpec`](data-formats.md#flattenspec) (if any), then 
[`timestampSpec`](#timestampspec), then [`transformSpec`](#transformspec),
-> and finally [`dimensionsSpec`](#dimensionsspec) and 
[`metricsSpec`](#metricsspec). Keep this in mind when writing
-> your ingestion spec.
-
-A `timestampSpec` can have the following components:
-
-|Field|Description|Default|
-|-----|-----------|-------|
-|column|Input row field to read the primary timestamp from.<br><br>Regardless 
of the name of this input field, the primary timestamp will always be stored as 
a column named `__time` in your Druid datasource.|timestamp|
-|format|Timestamp format. Options are: <ul><li>`iso`: ISO8601 with 'T' 
separator, like "2000-01-01T01:02:03.456"</li><li>`posix`: seconds since 
epoch</li><li>`millis`: milliseconds since epoch</li><li>`micro`: microseconds 
since epoch</li><li>`nano`: nanoseconds since epoch</li><li>`auto`: 
automatically detects ISO (either 'T' or space separator) or millis 
format</li><li>any [Joda DateTimeFormat 
string](http://joda-time.sourceforge.net/apidocs/org/joda/time/format/DateTimeFormat.html)</li></ul>|auto|
-|missingValue|Timestamp to use for input records that have a null or missing 
timestamp `column`. Should be in ISO8601 format, like 
`"2000-01-01T01:02:03.456"`, even if you have specified something else for 
`format`. Since Druid requires a primary timestamp, this setting can be useful 
for ingesting datasets that do not have any per-record timestamps at all. |none|
-
-### `dimensionsSpec`
-
-The `dimensionsSpec` is located in `dataSchema` → `dimensionsSpec` and is 
responsible for
-configuring [dimensions](#dimensions). An example `dimensionsSpec` is:
-
-```
-"dimensionsSpec" : {
-  "dimensions": [
-    "page",
-    "language",
-    { "type": "long", "name": "userId" }
-  ],
-  "dimensionExclusions" : [],
-  "spatialDimensions" : []
-}
-```
-
-> Conceptually, after input data records are read, Druid applies ingestion 
spec components in a particular order:
-> first [`flattenSpec`](data-formats.md#flattenspec) (if any), then 
[`timestampSpec`](#timestampspec), then [`transformSpec`](#transformspec),
-> and finally [`dimensionsSpec`](#dimensionsspec) and 
[`metricsSpec`](#metricsspec). Keep this in mind when writing
-> your ingestion spec.
-
-A `dimensionsSpec` can have the following components:
-
-| Field | Description | Default |
-|-------|-------------|---------|
-| dimensions | A list of [dimension names or objects](#dimension-objects). 
Cannot have the same column in both `dimensions` and 
`dimensionExclusions`.<br><br>If this and `spatialDimensions` are both null or 
empty arrays, Druid will treat all non-timestamp, non-metric columns that do 
not appear in `dimensionExclusions` as String-typed dimension columns. See 
[inclusions and exclusions](#inclusions-and-exclusions) below for details. | 
`[]` |
-| dimensionExclusions | The names of dimensions to exclude from ingestion. 
Only names are supported here, not objects.<br><br>This list is only used if 
the `dimensions` and `spatialDimensions` lists are both null or empty arrays; 
otherwise it is ignored. See [inclusions and 
exclusions](#inclusions-and-exclusions) below for details. | `[]` |
-| spatialDimensions | An array of [spatial dimensions](../development/geo.md). 
| `[]` |
-
-#### Dimension objects
-
-Each dimension in the `dimensions` list can either be a name or an object. 
Providing a name is equivalent to providing
-a `string` type dimension object with the given name, e.g. `"page"` is 
equivalent to `{"name": "page", "type": "string"}`.
-
-Dimension objects can have the following components:
-
-| Field | Description | Default |
-|-------|-------------|---------|
-| type | Either `string`, `long`, `float`, or `double`. | `string` |
-| name | The name of the dimension. This will be used as the field name to 
read from input records, as well as the column name stored in generated 
segments.<br><br>Note that you can use a [`transformSpec`](#transformspec) if 
you want to rename columns during ingestion time. | none (required) |
-| createBitmapIndex | For `string` typed dimensions, whether or not bitmap 
indexes should be created for the column in generated segments. Creating a 
bitmap index requires more storage, but speeds up certain kinds of filtering 
(especially equality and prefix filtering). Only supported for `string` typed 
dimensions. | `true` |
-| multiValueHandling | Specify the type of handling for [multi-value 
fields](../querying/multi-value-dimensions.md). Possible values are 
`sorted_array`, `sorted_set`, and `array`. `sorted_array` and `sorted_set` 
order the array upon ingestion. `sorted_set` removes duplicates. `array` 
ingests data as-is | `sorted_array` |
-
-#### Inclusions and exclusions
-
-Druid will interpret a `dimensionsSpec` in two possible ways: _normal_ or 
_schemaless_.
-
-Normal interpretation occurs when either `dimensions` or `spatialDimensions` 
is non-empty. In this case, the combination of the two lists will be taken as 
the set of dimensions to be ingested, and the list of `dimensionExclusions` 
will be ignored.
-
-Schemaless interpretation occurs when both `dimensions` and 
`spatialDimensions` are empty or null. In this case, the set of dimensions is 
determined in the following way:
-
-1. First, start from the set of all root-level fields from the input record, 
as determined by the [`inputFormat`](./data-formats.md). "Root-level" includes 
all fields at the top level of a data structure, but does not included fields 
nested within maps or lists. To extract these, you must use a 
[`flattenSpec`](./data-formats.md#flattenspec). All fields of non-nested data 
formats, such as CSV and delimited text, are considered root-level.
-2. If a [`flattenSpec`](./data-formats.md#flattenspec) is being used, the set 
of root-level fields includes any fields generated by the flattenSpec. The 
useFieldDiscovery parameter determines whether the original root-level fields 
will be retained or discarded.
-3. Any field listed in `dimensionExclusions` is excluded.
-4. The field listed as `column` in the [`timestampSpec`](#timestampspec) is 
excluded.
-5. Any field used as an input to an aggregator from the 
[metricsSpec](#metricsspec) is excluded.
-6. Any field with the same name as an aggregator from the 
[metricsSpec](#metricsspec) is excluded.
-7. All other fields are ingested as `string` typed dimensions with the 
[default settings](#dimension-objects).
-
-> Note: Fields generated by a [`transformSpec`](#transformspec) are not 
currently considered candidates for
-> schemaless dimension interpretation.
-
-### `metricsSpec`
-
-The `metricsSpec` is located in `dataSchema` → `metricsSpec` and is a list of 
[aggregators](../querying/aggregations.md)
-to apply at ingestion time. This is most useful when [rollup](#rollup) is 
enabled, since it's how you configure
-ingestion-time aggregation.
-
-An example `metricsSpec` is:
-
-```
-"metricsSpec": [
-  { "type": "count", "name": "count" },
-  { "type": "doubleSum", "name": "bytes_added_sum", "fieldName": "bytes_added" 
},
-  { "type": "doubleSum", "name": "bytes_deleted_sum", "fieldName": 
"bytes_deleted" }
-]
-```
-
-> Generally, when [rollup](#rollup) is disabled, you should have an empty 
`metricsSpec` (because without rollup,
-> Druid does not do any ingestion-time aggregation, so there is little reason 
to include an ingestion-time aggregator). However,
-> in some cases, it can still make sense to define metrics: for example, if 
you want to create a complex column as a way of
-> pre-computing part of an [approximate 
aggregation](../querying/aggregations.md#approximate-aggregations), this can 
only
-> be done by defining a metric in a `metricsSpec`.
-
-### `granularitySpec`
-
-The `granularitySpec` is located in `dataSchema` → `granularitySpec` and is 
responsible for configuring
-the following operations:
-
-1. Partitioning a datasource into [time 
chunks](../design/architecture.md#datasources-and-segments) (via 
`segmentGranularity`).
-2. Truncating the timestamp, if desired (via `queryGranularity`).
-3. Specifying which time chunks of segments should be created, for batch 
ingestion (via `intervals`).
-4. Specifying whether ingestion-time [rollup](#rollup) should be used or not 
(via `rollup`).
-
-Other than `rollup`, these operations are all based on the [primary 
timestamp](#primary-timestamp).
-
-An example `granularitySpec` is:
-
-```
-"granularitySpec": {
-  "segmentGranularity": "day",
-  "queryGranularity": "none",
-  "intervals": [
-    "2013-08-31/2013-09-01"
-  ],
-  "rollup": true
-}
-```
-
-A `granularitySpec` can have the following components:
-
-| Field | Description | Default |
-|-------|-------------|---------|
-| type | Either `uniform` or `arbitrary`. In most cases you want to use 
`uniform`.| `uniform` |
-| segmentGranularity | [Time 
chunking](../design/architecture.md#datasources-and-segments) granularity for 
this datasource. Multiple segments can be created per time chunk. For example, 
when set to `day`, the events of the same day fall into the same time chunk 
which can be optionally further partitioned into multiple segments based on 
other configurations and input size. Any 
[granularity](../querying/granularities.md) can be provided here. Note that all 
segments in the same time chunk should have the same segment 
granularity.<br><br>Ignored if `type` is set to `arbitrary`.| `day` |
-| queryGranularity | The resolution of timestamp storage within each segment. 
This must be equal to, or finer, than `segmentGranularity`. This will be the 
finest granularity that you can query at and still receive sensible results, 
but note that you can still query at anything coarser than this granularity. 
E.g., a value of `minute` will mean that records will be stored at minutely 
granularity, and can be sensibly queried at any multiple of minutes (including 
minutely, 5-minutely, hourly, etc).<br><br>Any 
[granularity](../querying/granularities.md) can be provided here. Use `none` to 
store timestamps as-is, without any truncation. Note that `rollup` will be 
applied if it is set even when the `queryGranularity` is set to `none`. | 
`none` |
-| rollup | Whether to use ingestion-time [rollup](#rollup) or not. Note that 
rollup is still effective even when `queryGranularity` is set to `none`. Your 
data will be rolled up if they have the exactly same timestamp. | `true` |
-| intervals | A list of intervals describing what time chunks of segments 
should be created. If `type` is set to `uniform`, this list will be broken up 
and rounded-off based on the `segmentGranularity`. If `type` is set to 
`arbitrary`, this list will be used as-is.<br><br>If `null` or not provided, 
batch ingestion tasks will generally determine which time chunks to output 
based on what timestamps are found in the input data.<br><br>If specified, 
batch ingestion tasks may be able to skip a determining-partitions phase, which 
can result in faster ingestion. Batch ingestion tasks may also be able to 
request all their locks up-front instead of one by one. Batch ingestion tasks 
will throw away any records with timestamps outside of the specified 
intervals.<br><br>Ignored for any form of streaming ingestion. | `null` |
-
-### `transformSpec`
-
-The `transformSpec` is located in `dataSchema` → `transformSpec` and is 
responsible for transforming and filtering
-records during ingestion time. It is optional. An example `transformSpec` is:
-
-```
-"transformSpec": {
-  "transforms": [
-    { "type": "expression", "name": "countryUpper", "expression": 
"upper(country)" }
-  ],
-  "filter": {
-    "type": "selector",
-    "dimension": "country",
-    "value": "San Serriffe"
-  }
-}
-```
-
-> Conceptually, after input data records are read, Druid applies ingestion 
spec components in a particular order:
-> first [`flattenSpec`](data-formats.md#flattenspec) (if any), then 
[`timestampSpec`](#timestampspec), then [`transformSpec`](#transformspec),
-> and finally [`dimensionsSpec`](#dimensionsspec) and 
[`metricsSpec`](#metricsspec). Keep this in mind when writing
-> your ingestion spec.
-
-#### Transforms
-
-The `transforms` list allows you to specify a set of expressions to evaluate 
on top of input data. Each transform has a
-"name" which can be referred to by your `dimensionsSpec`, `metricsSpec`, etc.
-
-If a transform has the same name as a field in an input row, then it will 
shadow the original field. Transforms that
-shadow fields may still refer to the fields they shadow. This can be used to 
transform a field "in-place".
-
-Transforms do have some limitations. They can only refer to fields present in 
the actual input rows; in particular,
-they cannot refer to other transforms. And they cannot remove fields, only add 
them. However, they can shadow a field
-with another field containing all nulls, which will act similarly to removing 
the field.
-
-Transforms can refer to the [timestamp](#timestampspec) of an input row by 
referring to `__time` as part of the expression.
-They can also _replace_ the timestamp if you set their "name" to `__time`. In 
both cases, `__time` should be treated as
-a millisecond timestamp (number of milliseconds since Jan 1, 1970 at midnight 
UTC). Transforms are applied _after_ the
-`timestampSpec`.
-
-Druid currently includes one kind of built-in transform, the expression 
transform. It has the following syntax:
-
-```
-{
-  "type": "expression",
-  "name": "<output name>",
-  "expression": "<expr>"
-}
-```
-
-The `expression` is a [Druid query expression](../misc/math-expr.md).
-
-> Conceptually, after input data records are read, Druid applies ingestion 
spec components in a particular order:
-> first [`flattenSpec`](data-formats.md#flattenspec) (if any), then 
[`timestampSpec`](#timestampspec), then [`transformSpec`](#transformspec),
-> and finally [`dimensionsSpec`](#dimensionsspec) and 
[`metricsSpec`](#metricsspec). Keep this in mind when writing
-> your ingestion spec.
-
-#### Filter
-
-The `filter` conditionally filters input rows during ingestion. Only rows that 
pass the filter will be
-ingested. Any of Druid's standard [query filters](../querying/filters.md) can 
be used. Note that within a
-`transformSpec`, the `transforms` are applied before the `filter`, so the 
filter can refer to a transform.
-
-### Legacy `dataSchema` spec
-
-> The `dataSchema` spec has been changed in 0.17.0. The new spec is supported 
by all ingestion methods
-except for _Hadoop_ ingestion. See [`dataSchema`](#dataschema) for the new 
spec.
-
-The legacy `dataSchema` spec has below two more components in addition to the 
ones listed in the [`dataSchema`](#dataschema) section above.
-
-- [input row parser](#parser-deprecated), [flattening of nested 
data](#flattenspec) (if needed)
-
-#### `parser` (Deprecated)
-
-In legacy `dataSchema`, the `parser` is located in the `dataSchema` → `parser` 
and is responsible for configuring a wide variety of
-items related to parsing input records. The `parser` is deprecated and it is 
highly recommended to use `inputFormat` instead.
-For details about `inputFormat` and supported `parser` types, see the ["Data 
formats" page](data-formats.md).
-
-For details about major components of the `parseSpec`, refer to their 
subsections:
-
-- [`timestampSpec`](#timestampspec), responsible for configuring the [primary 
timestamp](#primary-timestamp).
-- [`dimensionsSpec`](#dimensionsspec), responsible for configuring 
[dimensions](#dimensions).
-- [`flattenSpec`](#flattenspec), responsible for flattening nested data 
formats.
-
-An example `parser` is:
-
-```
-"parser": {
-  "type": "string",
-  "parseSpec": {
-    "format": "json",
-    "flattenSpec": {
-      "useFieldDiscovery": true,
-      "fields": [
-        { "type": "path", "name": "userId", "expr": "$.user.id" }
-      ]
-    },
-    "timestampSpec": {
-      "column": "timestamp",
-      "format": "auto"
-    },
-    "dimensionsSpec": {
-      "dimensions": [
-        "page",
-        "language",
-        { "type": "long", "name": "userId" }
-      ]
-    }
-  }
-}
-```
-
-#### `flattenSpec`
-
-In the legacy `dataSchema`, the `flattenSpec` is located in `dataSchema` → 
`parser` → `parseSpec` → `flattenSpec` and is responsible for
-bridging the gap between potentially nested input data (such as JSON, Avro, 
etc) and Druid's flat data model.
-See [Flatten spec](./data-formats.md#flattenspec) for more details.
-
-## `ioConfig`
-
-The `ioConfig` influences how data is read from a source system, such as 
Apache Kafka, Amazon S3, a mounted
-filesystem, or any other supported source system. The `inputFormat` property 
applies to all
-[ingestion method](#ingestion-methods) except for Hadoop ingestion. The Hadoop 
ingestion still
-uses the [`parser`](#parser-deprecated) in the legacy `dataSchema`.
-The rest of `ioConfig` is specific to each individual ingestion method.
-An example `ioConfig` to read JSON data is:
-
-```json
-"ioConfig": {
-    "type": "<ingestion-method-specific type code>",
-    "inputFormat": {
-      "type": "json"
-    },
-    ...
-}
-```
-For more details, see the documentation provided by each [ingestion 
method](#ingestion-methods).
-
-## `tuningConfig`
-
-Tuning properties are specified in a `tuningConfig`, which goes at the top 
level of an ingestion spec. Some
-properties apply to all [ingestion methods](#ingestion-methods), but most are 
specific to each individual
-ingestion method. An example `tuningConfig` that sets all of the shared, 
common properties to their defaults
-is:
-
-```plaintext
-"tuningConfig": {
-  "type": "<ingestion-method-specific type code>",
-  "maxRowsInMemory": 1000000,
-  "maxBytesInMemory": <one-sixth of JVM memory>,
-  "indexSpec": {
-    "bitmap": { "type": "roaring" },
-    "dimensionCompression": "lz4",
-    "metricCompression": "lz4",
-    "longEncoding": "longs"
-  },
-  <other ingestion-method-specific properties>
-}
-```
-
-|Field|Description|Default|
-|-----|-----------|-------|
-|type|Each ingestion method has its own tuning type code. You must specify the 
type code that matches your ingestion method. Common options are `index`, 
`hadoop`, `kafka`, and `kinesis`.||
-|maxRowsInMemory|The maximum number of records to store in memory before 
persisting to disk. Note that this is the number of rows post-rollup, and so it 
may not be equal to the number of input records. Ingested records will be 
persisted to disk when either `maxRowsInMemory` or `maxBytesInMemory` are 
reached (whichever happens first).|`1000000`|
-|maxBytesInMemory|The maximum aggregate size of records, in bytes, to store in 
the JVM heap before persisting. This is based on a rough estimate of memory 
usage. Ingested records will be persisted to disk when either `maxRowsInMemory` 
or `maxBytesInMemory` are reached (whichever happens first). `maxBytesInMemory` 
also includes heap usage of artifacts created from intermediary persists. This 
means that after every persist, the amount of `maxBytesInMemory` until next 
persist will decreases, and task will fail when the sum of bytes of all 
intermediary persisted artifacts exceeds `maxBytesInMemory`.<br /><br />Setting 
maxBytesInMemory to -1 disables this check, meaning Druid will rely entirely on 
maxRowsInMemory to control memory usage. Setting it to zero means the default 
value will be used (one-sixth of JVM heap size).<br /><br />Note that the 
estimate of memory usage is designed to be an overestimate, and can be 
especially high when using complex ingest-time aggregators, including sk
 etches. If this causes your indexing workloads to persist to disk too often, 
you can set maxBytesInMemory to -1 and rely on maxRowsInMemory 
instead.|One-sixth of max JVM heap size|
-|skipBytesInMemoryOverheadCheck|The calculation of maxBytesInMemory takes into 
account overhead objects created during ingestion and each intermediate 
persist. Setting this to true can exclude the bytes of these overhead objects 
from maxBytesInMemory check.|false|
-|indexSpec|Tune how data is indexed. See below for more information.|See table 
below|
-|Other properties|Each ingestion method has its own list of additional tuning 
properties. See the documentation for each method for a full list: [Kafka 
indexing 
service](../development/extensions-core/kafka-ingestion.md#tuningconfig), 
[Kinesis indexing 
service](../development/extensions-core/kinesis-ingestion.md#tuningconfig), 
[Native batch](native-batch.md#tuningconfig), and 
[Hadoop-based](hadoop.md#tuningconfig).||
-
-#### `indexSpec`
-
-The `indexSpec` object can include the following properties:
-
-|Field|Description|Default|
-|-----|-----------|-------|
-|bitmap|Compression format for bitmap indexes. Should be a JSON object with 
`type` set to `roaring` or `concise`. For type `roaring`, the boolean property 
`compressRunOnSerialization` (defaults to true) controls whether or not 
run-length encoding will be used when it is determined to be more 
space-efficient.|`{"type": "concise"}`|
-|dimensionCompression|Compression format for dimension columns. Options are 
`lz4`, `lzf`, or `uncompressed`.|`lz4`|
-|metricCompression|Compression format for primitive type metric columns. 
Options are `lz4`, `lzf`, `uncompressed`, or `none` (which is more efficient 
than `uncompressed`, but not supported by older versions of Druid).|`lz4`|
-|longEncoding|Encoding format for long-typed columns. Applies regardless of 
whether they are dimensions or metrics. Options are `auto` or `longs`. `auto` 
encodes the values using offset or lookup table depending on column 
cardinality, and store them with variable size. `longs` stores the value as-is 
with 8 bytes each.|`longs`|
-
-Beyond these properties, each ingestion method has its own specific tuning 
properties. See the documentation for each
-[ingestion method](#ingestion-methods) for details.
+| **[Rollup modes](rollup.md)** | Perfect if `forceGuaranteedRollup` = true in 
the [`tuningConfig`](native-batch.md#tuningconfig).  | Always perfect. | 
Perfect if `forceGuaranteedRollup` = true in the 
[`tuningConfig`](native-batch.md#tuningconfig). |
+| **Partitioning options** | Dynamic, hash-based, and range-based partitioning 
methods are available. See [Partitions Spec](./native-batch.md#partitionsspec) 
for details.| Hash-based or range-based partitioning via 
[`partitionsSpec`](hadoop.md#partitionsspec). | Dynamic and hash-based 
partitioning methods are available. See [Partitions 
Spec](./native-batch.md#partitionsspec-1) for details. |

Review comment:
       Normally I'd prefer the English "Partitions spec" to the code, but the 
header is `partitionsSpec` so went with that.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] techdocsmith commented on a change in pull request #11541: Docs Ingestion page refactor

Reply via email to