jihoonson commented on a change in pull request #8311: Docusaurus build framework + ingestion doc refresh. URL: https://github.com/apache/incubator-druid/pull/8311#discussion_r315436827
########## File path: docs/ingestion/index.md ########## @@ -0,0 +1,768 @@ +--- +id: index +title: "Ingestion" +--- + +<!-- + ~ Licensed to the Apache Software Foundation (ASF) under one + ~ or more contributor license agreements. See the NOTICE file + ~ distributed with this work for additional information + ~ regarding copyright ownership. The ASF licenses this file + ~ to you under the Apache License, Version 2.0 (the + ~ "License"); you may not use this file except in compliance + ~ with the License. You may obtain a copy of the License at + ~ + ~ http://www.apache.org/licenses/LICENSE-2.0 + ~ + ~ Unless required by applicable law or agreed to in writing, + ~ software distributed under the License is distributed on an + ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + ~ KIND, either express or implied. See the License for the + ~ specific language governing permissions and limitations + ~ under the License. + --> + +## Overview + +All data in Druid is organized into _segments_, which are data files that generally have up to a few million rows each. +Loading data in Druid is called _ingestion_ or _indexing_ and consists of reading data from a source system and creating +segments based on that data. + +In most ingestion methods, the work of loading data is done by Druid MiddleManager processes. One exception is +Hadoop-based ingestion, where this work is instead done using a Hadoop MapReduce job on YARN (although MiddleManager +processes are still involved in starting and monitoring the Hadoop jobs). Once segments have been generated and stored +in [deep storage](../dependencies/deep-storage.md), they will be loaded by Historical processes. For more details on +how this works under the hood, see the [Storage design](../design/architecture.md#storage-design) section of Druid's design +documentation. + +## How to use this documentation + +This **page you are currently reading** provides information about universal Druid ingestion concepts, and about +configurations that are common to all [ingestion methods](#ingestion-methods). + +The **individual pages for each ingestion method** provide additional information about concepts and configurations +that are unique to each ingestion method. + +We recommend reading (or at least skimming) this universal page first, and then referring to the page for the +ingestion method or methods that you have chosen. + +## Ingestion methods + +The table below lists Druid's most common data ingestion methods, along with comparisons to help you choose +the best one for your situation. Each ingestion method supports its own set of source systems to pull from. For details +about how each method works, as well as configuration properties specific to that method, check out its documentation +page. + +### Streaming + +The most recommended, and most popular, method of streaming ingestion is the +[Kafka indexing service](../development/extensions-core/kafka-ingestion.md) that reads directly from Kafka. The Kinesis +indexing service also works well if you prefer Kinesis. + +This table compares the major available options: + +| **Method** | [Kafka](../development/extensions-core/kafka-ingestion.md) | [Kinesis](../development/extensions-core/kinesis-ingestion.md) | [Tranquility](tranquility.md) | +|---|-----|--------------|------------| +| **Supervisor type** | `kafka` | `kinesis` | N/A | +| **How it works** | Druid reads directly from Apache Kafka. | Druid reads directly from Amazon Kinesis. | Tranquility, a library that ships separately from Druid, is used to push data into Druid. | +| **Can ingest late data?** | Yes | Yes | No (late data is dropped based on the `windowPeriod` config) | +| **Exactly-once guarantees?** | Yes | Yes | No | + +### Batch + +When doing batch loads from files, you should use one-time [tasks](tasks.md), and you have three options: `index` +(native batch; single-task), `index_parallel` (native batch; parallel), or `index_hadoop` (Hadoop-based). + +In general, we recommend native batch whenever it meets your needs, since the setup is simpler (it does not depend on +an external Hadoop cluster). However, there are still scenarios where Hadoop-based batch ingestion is the right choice, +especially due to its support for custom partitioning options and reading binary data formats. + +This table compares the three available options: + +| **Method** | [Native batch (simple)](native-batch.html#simple-task) | [Native batch (parallel)](native-batch.html#parallel-task) | [Hadoop-based](hadoop.html) | +|---|-----|--------------|------------| +| **Task type** | `index` | `index_parallel` | `index_hadoop` | +| **Parallel?** | No. Each task is single-threaded. | Yes, if firehose is splittable and `maxNumConcurrentSubTasks` > 1 in tuningConfig. See [firehose documentation](native-batch.md#firehoses) for details. | Yes, always. | +| **Can append or overwrite?** | Yes, both. | Yes, both. | Overwrite only. | +| **External dependencies** | None. | None. | Hadoop cluster (Druid submits Map/Reduce jobs). | +| **Input locations** | Any [firehose](native-batch.md#firehoses). | Any [firehose](native-batch.md#firehoses). | Any Hadoop FileSystem or Druid datasource. | +| **File formats** | Text file formats (CSV, TSV, JSON). Support for binary formats is coming in a future release. | Text file formats (CSV, TSV, JSON). Support for binary formats is coming in a future release. | Any Hadoop InputFormat. | +| **[Rollup modes](#rollup)** | Perfect if `forceGuaranteedRollup` = true in the [`tuningConfig`](native-batch.md#tuningconfig).| Only best-effort. Support for perfect rollup is coming in a future release. | Always perfect. | Review comment: This table looks gone stale. Would you please update it as it is in the master? ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
