[GitHub] [incubator-druid] jihoonson commented on a change in pull request #8311: Docusaurus build framework + ingestion doc refresh.

GitBox Mon, 19 Aug 2019 17:04:02 -0700

jihoonson commented on a change in pull request #8311: Docusaurus build 
framework + ingestion doc refresh.
URL: https://github.com/apache/incubator-druid/pull/8311#discussion_r315436827


 ##########
 File path: docs/ingestion/index.md
 ##########
 @@ -0,0 +1,768 @@
+---
+id: index
+title: "Ingestion"
+---
+
+<!--
+  ~ Licensed to the Apache Software Foundation (ASF) under one
+  ~ or more contributor license agreements.  See the NOTICE file
+  ~ distributed with this work for additional information
+  ~ regarding copyright ownership.  The ASF licenses this file
+  ~ to you under the Apache License, Version 2.0 (the
+  ~ "License"); you may not use this file except in compliance
+  ~ with the License.  You may obtain a copy of the License at
+  ~
+  ~   http://www.apache.org/licenses/LICENSE-2.0
+  ~
+  ~ Unless required by applicable law or agreed to in writing,
+  ~ software distributed under the License is distributed on an
+  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  ~ KIND, either express or implied.  See the License for the
+  ~ specific language governing permissions and limitations
+  ~ under the License.
+  -->
+
+## Overview
+
+All data in Druid is organized into _segments_, which are data files that 
generally have up to a few million rows each.
+Loading data in Druid is called _ingestion_ or _indexing_ and consists of 
reading data from a source system and creating
+segments based on that data.
+
+In most ingestion methods, the work of loading data is done by Druid 
MiddleManager processes. One exception is
+Hadoop-based ingestion, where this work is instead done using a Hadoop 
MapReduce job on YARN (although MiddleManager
+processes are still involved in starting and monitoring the Hadoop jobs). Once 
segments have been generated and stored
+in [deep storage](../dependencies/deep-storage.md), they will be loaded by 
Historical processes. For more details on
+how this works under the hood, see the [Storage 
design](../design/architecture.md#storage-design) section of Druid's design
+documentation.
+
+## How to use this documentation
+
+This **page you are currently reading** provides information about universal 
Druid ingestion concepts, and about
+configurations that are common to all [ingestion methods](#ingestion-methods).
+
+The **individual pages for each ingestion method** provide additional 
information about concepts and configurations
+that are unique to each ingestion method.
+
+We recommend reading (or at least skimming) this universal page first, and 
then referring to the page for the
+ingestion method or methods that you have chosen.
+
+## Ingestion methods
+
+The table below lists Druid's most common data ingestion methods, along with 
comparisons to help you choose
+the best one for your situation. Each ingestion method supports its own set of 
source systems to pull from. For details
+about how each method works, as well as configuration properties specific to 
that method, check out its documentation
+page.
+
+### Streaming
+
+The most recommended, and most popular, method of streaming ingestion is the
+[Kafka indexing service](../development/extensions-core/kafka-ingestion.md) 
that reads directly from Kafka. The Kinesis
+indexing service also works well if you prefer Kinesis.
+
+This table compares the major available options:
+
+| **Method** | [Kafka](../development/extensions-core/kafka-ingestion.md) | 
[Kinesis](../development/extensions-core/kinesis-ingestion.md) | 
[Tranquility](tranquility.md) |
+|---|-----|--------------|------------|
+| **Supervisor type** | `kafka` | `kinesis` | N/A |
+| **How it works** | Druid reads directly from Apache Kafka. | Druid reads 
directly from Amazon Kinesis. | Tranquility, a library that ships separately 
from Druid, is used to push data into Druid. |
+| **Can ingest late data?** | Yes | Yes | No (late data is dropped based on 
the `windowPeriod` config) |
+| **Exactly-once guarantees?** | Yes | Yes | No |
+
+### Batch
+
+When doing batch loads from files, you should use one-time [tasks](tasks.md), 
and you have three options: `index`
+(native batch; single-task), `index_parallel` (native batch; parallel), or 
`index_hadoop` (Hadoop-based).
+
+In general, we recommend native batch whenever it meets your needs, since the 
setup is simpler (it does not depend on
+an external Hadoop cluster). However, there are still scenarios where 
Hadoop-based batch ingestion is the right choice,
+especially due to its support for custom partitioning options and reading 
binary data formats.
+
+This table compares the three available options:
+
+| **Method** | [Native batch (simple)](native-batch.html#simple-task) | 
[Native batch (parallel)](native-batch.html#parallel-task) | 
[Hadoop-based](hadoop.html) |
+|---|-----|--------------|------------|
+| **Task type** | `index` | `index_parallel` | `index_hadoop` |
+| **Parallel?** | No. Each task is single-threaded. | Yes, if firehose is 
splittable and `maxNumConcurrentSubTasks` > 1 in tuningConfig. See [firehose 
documentation](native-batch.md#firehoses) for details. | Yes, always. |
+| **Can append or overwrite?** | Yes, both. | Yes, both. | Overwrite only. |
+| **External dependencies** | None. | None. | Hadoop cluster (Druid submits 
Map/Reduce jobs). |
+| **Input locations** | Any [firehose](native-batch.md#firehoses). | Any 
[firehose](native-batch.md#firehoses). | Any Hadoop FileSystem or Druid 
datasource. |
+| **File formats** | Text file formats (CSV, TSV, JSON). Support for binary 
formats is coming in a future release. | Text file formats (CSV, TSV, JSON). 
Support for binary formats is coming in a future release. | Any Hadoop 
InputFormat. |
+| **[Rollup modes](#rollup)** | Perfect if `forceGuaranteedRollup` = true in 
the [`tuningConfig`](native-batch.md#tuningconfig).| Only best-effort. Support 
for perfect rollup is coming in a future release. | Always perfect. |
 
 Review comment:
   This table looks gone stale. Would you please update it as it is in the 
master?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [incubator-druid] jihoonson commented on a change in pull request #8311: Docusaurus build framework + ingestion doc refresh.

Reply via email to