Re: [PR] [Docs] Refactor streaming ingestion section (druid)

via GitHub Mon, 05 Feb 2024 19:42:32 -0800


ektravel commented on code in PR #15591:
URL: https://github.com/apache/druid/pull/15591#discussion_r1479186254



##########
docs/ingestion/kafka-ingestion.md:
##########
@@ -0,0 +1,448 @@
+---
+id: kafka-ingestion
+title: "Apache Kafka ingestion"
+sidebar_label: "Apache Kafka ingestion"
+description: "Overview of the Kafka indexing service for Druid. Includes 
example supervisor specs to help you get started."
+---
+
+<!--
+  ~ Licensed to the Apache Software Foundation (ASF) under one
+  ~ or more contributor license agreements.  See the NOTICE file
+  ~ distributed with this work for additional information
+  ~ regarding copyright ownership.  The ASF licenses this file
+  ~ to you under the Apache License, Version 2.0 (the
+  ~ "License"); you may not use this file except in compliance
+  ~ with the License.  You may obtain a copy of the License at
+  ~
+  ~   http://www.apache.org/licenses/LICENSE-2.0
+  ~
+  ~ Unless required by applicable law or agreed to in writing,
+  ~ software distributed under the License is distributed on an
+  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  ~ KIND, either express or implied.  See the License for the
+  ~ specific language governing permissions and limitations
+  ~ under the License.
+  -->
+
+:::info
+To use the Kafka indexing service, you must be on Apache Kafka version 0.11.x 
or higher.
+If you are using an older version, refer to the [Apache Kafka upgrade 
guide](https://kafka.apache.org/documentation/#upgrade).
+:::
+
+When you enable the Kafka indexing service, you can configure supervisors on 
the Overlord to manage the creation and lifetime of Kafka indexing tasks.
+Kafka indexing tasks read events using Kafka partition and offset mechanism to 
guarantee exactly-once ingestion. The supervisor oversees the state of the 
indexing tasks to coordinate handoffs, manage failures, and ensure that 
scalability and replication requirements are maintained.
+
+This topic contains configuration information for the Kafka indexing service 
supervisor for Apache Druid.
+
+## Setup
+
+To use the Kafka indexing service, you must first load the 
`druid-kafka-indexing-service` extension on both the Overlord and the 
MiddleManager. See [Loading extensions](../configuration/extensions.md) for 
more information.
+
+## Supervisor spec configuration
+
+This section outlines the configuration properties that are specific to the 
Apache Kafka streaming ingestion method. For configuration properties shared 
across all streaming ingestion methods supported by Druid, see [Supervisor 
spec](supervisor.md#supervisor-spec).
+
+The following example shows a supervisor spec for the Kafka indexing service:
+
+<details>
+  <summary>Click to view the example</summary>
+
+```json
+{
+  "type": "kafka",
+  "spec": {
+    "dataSchema": {
+      "dataSource": "metrics-kafka",
+      "timestampSpec": {
+        "column": "timestamp",
+        "format": "auto"
+      },
+      "dimensionsSpec": {
+        "dimensions": [],
+        "dimensionExclusions": [
+          "timestamp",
+          "value"
+        ]
+      },
+      "metricsSpec": [
+        {
+          "name": "count",
+          "type": "count"
+        },
+        {
+          "name": "value_sum",
+          "fieldName": "value",
+          "type": "doubleSum"
+        },
+        {
+          "name": "value_min",
+          "fieldName": "value",
+          "type": "doubleMin"
+        },
+        {
+          "name": "value_max",
+          "fieldName": "value",
+          "type": "doubleMax"
+        }
+      ],
+      "granularitySpec": {
+        "type": "uniform",
+        "segmentGranularity": "HOUR",
+        "queryGranularity": "NONE"
+      }
+    },
+    "ioConfig": {
+      "topic": "metrics",
+      "inputFormat": {
+        "type": "json"
+      },
+      "consumerProperties": {
+        "bootstrap.servers": "localhost:9092"
+      },
+      "taskCount": 1,
+      "replicas": 1,
+      "taskDuration": "PT1H"
+    },
+    "tuningConfig": {
+      "type": "kafka",
+      "maxRowsPerSegment": 5000000
+    }
+  }
+}
+```
+
+</details>
+
+### I/O configuration
+
+The following table outlines the `ioConfig` configuration properties specific 
to Kafka.
+For configuration properties shared across all streaming ingestion methods, 
refer to [Supervisor I/O configuration](supervisor.md#io-configuration).
+
+|Property|Type|Description|Required|Default|
+|--------|----|-----------|--------|-------|
+|`topic`|String|The Kafka topic to read from. To ingest data from multiple 
topic, use `topicPattern`. |Yes if `topicPattern` isn't set.||
+|`topicPattern`|String|Multiple Kafka topics to read from, passed as a regex 
pattern. See [Ingest from multiple topics](#ingest-from-multiple-topics) for 
more information.|Yes if `topic` isn't set.||
+|`consumerProperties`|String, Object|A map of properties to pass to the Kafka 
consumer. See [Consumer properties](#consumer-properties) for details.|Yes||
+|`pollTimeout`|Long|The length of time to wait for the Kafka consumer to poll 
records, in milliseconds.|No|100|
+|`useEarliestOffset`|Boolean|If a supervisor manages a datasource for the 
first time, it obtains a set of starting offsets from Kafka. This flag 
determines whether it retrieves the earliest or latest offsets in Kafka. Under 
normal circumstances, subsequent tasks start from where the previous segments 
ended. Druid only uses `useEarliestOffset` on the first run.|No|`false`|
+|`idleConfig`|Object|Defines how and when the Kafka supervisor can become 
idle. See [Idle configuration](#idle-configuration) for more details.|No|null|
+
+#### Ingest from multiple topics
+
+:::info
+If you enable multi-topic ingestion for a datasource, downgrading to a version 
older than
+28.0.0 will cause the ingestion for that datasource to fail.
+:::
+
+You can ingest data from one or multiple topics.
+When ingesting data from multiple topics, Druid assigns partitions based on 
the hashcode of the topic name and the ID of the partition within that topic. 
The partition assignment might not be uniform across all the tasks. Druid 
assumes that partitions across individual topics have similar load. If you want 
to ingest from both high and low load topics in the same supervisor, it is 
recommended that you have a higher number of partitions for a high load topic 
and a lower number of partitions for a low load topic.
+
+To ingest data from multiple topics, use the `topicPattern` property instead 
of `topic`.
+You pass multiple topics as a regex pattern. For example, to ingest data from 
clicks and impressions, set `topicPattern` to `clicks|impressions`.
+Similarly, you can use `metrics-.*` as the value for `topicPattern` if you 
want to ingest from all the topics that start with `metrics-`. If you add a new 
topic that matches the regex to the cluster, Druid automatically starts 
ingesting from the new topic. Topic names that match partially, such as 
`my-metrics-12`, are not included for ingestion.
+
+#### Consumer properties
+
+Consumer properties must contain a property `bootstrap.servers` with a list of 
Kafka brokers in the form: `<BROKER_1>:<PORT_1>,<BROKER_2>:<PORT_2>,...`.

Review Comment:
   Added an intro sentence. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [Docs] Refactor streaming ingestion section (druid)

Reply via email to