ccaominh commented on a change in pull request #8311: Docusaurus build 
framework + ingestion doc refresh.
URL: https://github.com/apache/incubator-druid/pull/8311#discussion_r314434911
 
 

 ##########
 File path: docs/ingestion/index.md
 ##########
 @@ -0,0 +1,769 @@
+---
+id: index
+title: "Ingestion"
+---
+
+<!--
+  ~ Licensed to the Apache Software Foundation (ASF) under one
+  ~ or more contributor license agreements.  See the NOTICE file
+  ~ distributed with this work for additional information
+  ~ regarding copyright ownership.  The ASF licenses this file
+  ~ to you under the Apache License, Version 2.0 (the
+  ~ "License"); you may not use this file except in compliance
+  ~ with the License.  You may obtain a copy of the License at
+  ~
+  ~   http://www.apache.org/licenses/LICENSE-2.0
+  ~
+  ~ Unless required by applicable law or agreed to in writing,
+  ~ software distributed under the License is distributed on an
+  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  ~ KIND, either express or implied.  See the License for the
+  ~ specific language governing permissions and limitations
+  ~ under the License.
+  -->
+
+## Overview
+
+All data in Druid is organized into _segments_, which are data files that 
generally have up to a few million rows each.
+Loading data in Druid is called _ingestion_ or _indexing_ and consists of 
reading data from a source system and creating
+segments based on that data.
+
+In most ingestion methods, the work of loading data is done by Druid 
MiddleManager processes. One exception is
+Hadoop-based ingestion, where this work is instead done using a Hadoop 
MapReduce job on YARN (although MiddleManager
+processes are still involved in starting and monitoring the Hadoop jobs). Once 
segments have been generated and stored
+in [deep storage](../dependencies/deep-storage.md), they will be loaded by 
Historical processes. For more details on
+how this works under the hood, see the [Storage 
design](../design/architecture.md#storage-design) section of Druid's design
+documentation.
+
+## How to use this documentation
+
+This **page you are currently reading** provides information about universal 
Druid ingestion concepts, and about
+configurations that are common to all [ingestion methods](#ingestion-methods).
+
+The **individual pages for each ingestion method** provide additional 
information about concepts and configurations
+that are unique to each ingestion method.
+
+We recommend reading (or at least skimming) this universal page first, and 
then referring to the page for the
+ingestion method or methods that you have chosen.
+
+## Ingestion methods
+
+The table below lists Druid's most common data ingestion methods, along with 
comparisons to help you choose
+the best one for your situation. Each ingestion method supports its own set of 
source systems to pull from. For details
+about how each method works, as well as configuration properties specific to 
that method, check out its documentation
+page.
+
+### Streaming
+
+The most recommended, and most popular, method of streaming ingestion is the
+[Kafka indexing service](../development/extensions-core/kafka-ingestion.md) 
that reads directly from Kafka. The Kinesis
+indexing service also works well if you prefer Kinesis.
+
+This table compares the major available options:
+
+| **Method** | [Kafka](../development/extensions-core/kafka-ingestion.md) | 
[Kinesis](../development/extensions-core/kinesis-ingestion.md) | 
[Tranquility](tranquility.md) |
+|---|-----|--------------|------------|
+| **Supervisor type** | `kafka` | `kinesis` | N/A |
+| **How it works** | Druid reads directly from Apache Kafka. | Druid reads 
directly from Amazon Kinesis. | Tranquility, a library that ships separately 
from Druid, is used to push data into Druid. |
+| **Can ingest late data?** | Yes | Yes | No (late data is dropped based on 
the `windowPeriod` config) |
+| **Exactly-once guarantees?** | Yes | Yes | No |
+
+### Batch
+
+When doing batch loads from files, you should use one-time [tasks](tasks.md), 
and you have three options: `index`
+(native batch; single-task), `index_parallel` (native batch; parallel), or 
`index_hadoop` (Hadoop-based). The following
+table compares and contrasts the three batch ingestion options.
+
+In general, we recommend native batch whenever it meets your needs, since the 
setup is simpler (it does not depend on
+an external Hadoop cluster). However, there are still scenarios where 
Hadoop-based batch ingestion is the right choice,
+especially due to its support for custom partitioning options and reading 
binary data formats.
+
+This table compares the major available options:
+
+| **Method** | [Native batch (simple)](native-batch.html#simple-task) | 
[Native batch (parallel)](native-batch.html#parallel-task) | 
[Hadoop-based](hadoop.html) |
+|---|-----|--------------|------------|
+| **Task type** | `index` | `index_parallel` | `index_hadoop` |
+| **Automatically parallel?** | No. Each task is single-threaded. | Yes, if 
firehose is splittable. See [firehose documentation](native-batch.md#firehoses) 
for details. | Yes, always. |
+| **Can append or overwrite?** | Yes, both. | Yes, both. | Overwrite only. |
+| **External dependencies** | None. | None. | Hadoop cluster (Druid submits 
Map/Reduce jobs). |
+| **Input locations** | Any [firehose](native-batch.md#firehoses). | Any 
[firehose](native-batch.md#firehoses). | Any Hadoop FileSystem or Druid 
datasource. |
+| **File formats** | Text file formats (CSV, TSV, JSON). Support for binary 
formats is coming in a future release. | Text file formats (CSV, TSV, JSON). 
Support for binary formats is coming in a future release. | Any Hadoop 
InputFormat. |
+| **[Rollup modes](#rollup)** | Perfect if `forceGuaranteedRollup` = true in 
the [`tuningConfig`](native-batch.md#tuningconfig).| Only best-effort. Support 
for perfect rollup is coming in a future release. | Always perfect. |
+| **Partitioning options** | Hash-based partitioning is supported when 
`forceGuaranteedRollup` = true in the 
[`tuningConfig`](native-batch.md#tuningconfig). | None. Support for 
partitioning is coming in a future release. | Hash-based or range-based 
partitioning via [`partitionsSpec`](hadoop.html#partitioning-specification). |
+
+## Druid's data model
+
+### Datasources
+
+Druid data is stored in [datasources](index.html#datasources), which are 
similar to tables in a traditional RDBMS. Druid
+offers a unique data modeling system that bears similarity to both relational 
and timeseries models.
+
+### Primary timestamp
+
+Druid schemas must always include a primary timestamp. The primary timestamp 
is used for
+[partitioning and sorting](#partitioning) your data. Druid queries are able to 
rapidly identify and retrieve data
+corresponding to time ranges of the primary timestamp column. Druid is also 
able to use the primary timestamp column
+for time-based [data management operations](data-management.html) such as 
dropping time chunks, overwriting time chunks,
+and time-based retention rules.
+
+The primary timestamp is parsed based on the 
[`timestampSpec`](#timestampspec). In addition, the
+[`granularitySpec`](#granularityspec) controls other important operations that 
are based on the primary timestamp.
+Regardless of which input field the primary timestamp is read from, it will 
always be stored as a column named `__time`
+in your Druid datasource.
+
+If you have more than one timestamp column, you can store the others as
+[secondary timestamps](schema-design.md#secondary-timestamps).
+
+### Dimensions
+
+Dimensions are columns that are stored as-is and can be used for any purpose. 
You can group, filter, or apply
+aggregators to dimensions at query time in an ad-hoc manner. If you run with 
[rollup](#rollup) disabled, then the set of
+dimensions is simply treated like a set of columns to ingest, and behaves 
exactly as you would expect from a typical
+database that does not support a rollup feature.
+
+Dimensions are configured through the [`dimensionsSpec`](#dimensionsspec).
+
+### Metrics
+
+Metrics are columns that are stored in an aggregated form. They are most 
useful when [rollup](#rollup) is enabled.
+Specifying a metric allows you to choose an aggregation function for Druid to 
apply to each row during ingestion. This
+has two benefits:
+
+1. If [rollup](#rollup) is enabled, multiple rows can be collapsed into one 
row even while retaining summary
+information. In the [rollup tutorial](../tutorials/tutorial-rollup.md), this 
is used to collapse netflow data to a
+single row per `(minute, srcIP, dstIP)` tuple, while retaining aggregate 
information about total packet and byte counts.
+2. Some aggregators, especially approximate ones, can be computed faster at 
query time even on non-rolled-up data if
+they are partially computed at ingestion time.
+
+Metrics are configured through the [`metricsSpec`](#metricsspec).
+
+## Rollup
+
+### What is rollup?
+
+Druid can roll up data as it is ingested to minimize the amount of raw data 
that needs to be stored. Rollup is
+a form of summarization or pre-aggregation. In practice, rolling up data can 
dramatically reduce the size of data that
+needs to be stored, reducing row counts by potentially orders of magnitude. 
This storage reduction does come at a cost:
+as we roll up data, we lose the ability to query individual events.
+
+When rollup is disabled, Druid loads each row as-is without doing any form of 
pre-aggregation. This mode is similar
+to what you would expect from a typical database that does not support a 
rollup feature.
+
+When rollup is enabled, then any rows that have identical 
[dimensions](#dimensions) and [timestamp](#primary-timestamp)
+to each other (after [`queryGranularity`-based truncation](#granularityspec)) 
can be collapsed, or _rolled up_, into a
+single row in Druid.
+
+By default, rollup is enabled.
+
+### Enabling or disabling rollup
+
+Rollup is controlled by the `rollup` setting in the 
[`granularitySpec`](#granularityspec). By default, it is `true`
+(enabled). Set this to `false` if you want Druid to store each record as-is, 
without any rollup summarization.
+
+### Example of rollup
+
+For an example of how to configure rollup, and of what how the feature will 
modify your data, check out the
+[rollup tutorial](../tutorials/tutorial-rollup.md).
+
+### Maximizing rollup ratio
+
+You can measure the rollup ratio of a datasource by comparing the number of 
rows in Druid with the number of ingested
+events. The higher this number, the more benefit you are gaining from rollup. 
One way to do this is with a
+[Druid SQL](../querying/sql.md) query like:
+
+```sql
+SELECT SUM("cnt") / COUNT(*) * 1.0 FROM datasource
+```
+
+In this query, `cnt` should refer to a "count" type metric specified at 
ingestion time. See
+[Counting the number of ingested events](schema-design.md#counting) on the 
"Schema design" page for more details about
+how counting works when rollup is enabled.
+
+Tips for maximizing rollup:
+
+- Generally, the fewer dimensions you have, and the lower the cardinality of 
your dimensions, the better rollup ratios
+you will achieve.
+- Use [sketches](#sketches) to avoid storing high cardinality dimensions, 
which harm rollup ratios.
+- Adjusting `queryGranularity` at ingestion time (for example, using `PT5M` 
instead of `PT1M`) increases the
+likelihood of two rows in Druid having matching timestamps, and can improve 
your rollup ratios.
+- It can be beneficial to load the same data into more than one Druid 
datasource. Some users choose to create a "full"
+datasource that has rollup disabled (or enabled, but with a minimal rollup 
ratio) and an "abbreviated" datasource that
+has fewer dimensions and a higher rollup ratio. When queries only involve 
dimensions in the "abbreviated" set, using
+that datasource leads to much faster query times. This can often be done with 
just a small increase in storage
+footprint, since abbreviated datasources tend to be substantially smaller.
+- If you are using a [best-effort rollup](#best-effort-rollup) ingestion 
configuration that does not guarantee perfect
+rollup, you can potentially improve your rollup ratio by switching to a 
guaranteed perfect rollup option, or by
+[reindexing](data-management.md#compaction-and-reindexing) your data in the 
background after initial ingestion.
+
+### Best-effort rollup
+
+Some Druid ingestion methods guarantee _perfect rollup_, meaning that input 
data are perfectly aggregated at ingestion
+time. Others offer _best-effort rollup_, meaming that input data might not be 
perfectly aggregated and thus there could
+be multiple segments holding rows with the same timestamp and dimension values.
+
+In general, ingestion methods that offer best-effort rollup do this because 
they are either parallelizing ingestion
+without a shuffling step (which would be required for perfect rollup), or 
because they are finalizing and publishing
+segments before all data for a time chunk has been received, which we call 
_incremental publishing_. In both of these
+cases, records may end up in different segments that are received by 
different, non-shuffling tasks cannot be rolled
 
 Review comment:
   This sentence needs to be reworded. Perhaps something like: non-shuffling 
tasks cannot be -> non-shuffling tasks and cannot be

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org

Reply via email to