[druid] branch master updated: Update Kafka ingestion tutorial (#13261)

techdocsmith Fri, 11 Nov 2022 14:48:07 -0800

This is an automated email from the ASF dual-hosted git repository.

techdocsmith pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/druid.git



The following commit(s) were added to refs/heads/master by this push:
     new b0db2a87d8 Update Kafka ingestion tutorial (#13261)
b0db2a87d8 is described below

commit b0db2a87d82a49d64f61b72c0066a13904928e69
Author: Jill Osborne <[email protected]>
AuthorDate: Fri Nov 11 22:47:54 2022 +0000

    Update Kafka ingestion tutorial (#13261)
    
    * Update Kafka ingestion tutorial
    
    * Update tutorial-kafka.md
    
    Updated location of sample data file
    
    * Added sample data file
    
    * Update tutorial-kafka.md
    
    * Add sample data file
    
    * Update tutorial-kafka.md
    
    Updated sample file location in curl commands
    
    * Update and reuploading sample data files
    
    * Updated spelling file
    
    * Delete .spelling
    
    * Added spelling file
    
    * Update docs/tutorials/tutorial-kafka.md
    
    Co-authored-by: Victoria Lim <[email protected]>
    
    * Update docs/tutorials/tutorial-kafka.md
    
    Co-authored-by: Victoria Lim <[email protected]>
    
    * Updated after review
    
    * Update tutorial-kafka.md
    
    * Updated
    
    * Update tutorial-kafka.md
    
    * Update tutorial-kafka.md
    
    * Update tutorial-kafka.md
    
    * Updated sample data file and command
    
    * Add files via upload
    
    * Delete kttm-nested-data.json.tgz
    
    * Delete kttm-nested-data.json.tgz
    
    * Add files via upload
    
    * Update tutorial-kafka.md
    
    Co-authored-by: Victoria Lim <[email protected]>
---
 docs/assets/files/kttm-kafka-supervisor.json       |  66 ++++
 docs/assets/files/kttm-nested-data.json.tgz        | Bin 0 -> 24955539 bytes
 docs/assets/tutorial-kafka-data-loader-01.png      | Bin 118171 -> 75787 bytes
 docs/assets/tutorial-kafka-data-loader-02.png      | Bin 613518 -> 420744 bytes
 docs/assets/tutorial-kafka-data-loader-03.png      | Bin 201934 -> 192051 bytes
 docs/assets/tutorial-kafka-data-loader-04.png      | Bin 252479 -> 244021 bytes
 docs/assets/tutorial-kafka-data-loader-05.png      | Bin 256966 -> 206028 bytes
 docs/assets/tutorial-kafka-data-loader-05b.png     | Bin 0 -> 202167 bytes
 docs/assets/tutorial-kafka-data-loader-06.png      | Bin 94346 -> 95274 bytes
 docs/assets/tutorial-kafka-data-loader-07.png      | Bin 135815 -> 114038 bytes
 docs/assets/tutorial-kafka-data-loader-08.png      | Bin 97816 -> 81750 bytes
 docs/assets/tutorial-kafka-data-loader-09.png      | Bin 171974 -> 141501 bytes
 docs/assets/tutorial-kafka-data-loader-10.png      | Bin 113867 -> 80127 bytes
 docs/assets/tutorial-kafka-data-loader-11.png      | Bin 120255 -> 74913 bytes
 docs/assets/tutorial-kafka-data-loader-12.png      | Bin 153000 -> 134001 bytes
 .../assets/tutorial-kafka-submit-supervisor-01.png | Bin 143733 -> 130690 bytes
 docs/tutorials/tutorial-kafka.md                   | 402 +++++++++++----------
 17 files changed, 274 insertions(+), 194 deletions(-)

diff --git a/docs/assets/files/kttm-kafka-supervisor.json 
b/docs/assets/files/kttm-kafka-supervisor.json
new file mode 100644
index 0000000000..2096f9c7cd
--- /dev/null
+++ b/docs/assets/files/kttm-kafka-supervisor.json
@@ -0,0 +1,66 @@
+{
+  "type": "kafka",
+  "spec": {
+    "ioConfig": {
+      "type": "kafka",
+      "consumerProperties": {
+        "bootstrap.servers": "localhost:9092"
+      },
+      "topic": "kttm",
+      "inputFormat": {
+        "type": "json"
+      },
+      "useEarliestOffset": true
+    },
+    "tuningConfig": {
+      "type": "kafka"
+    },
+    "dataSchema": {
+      "dataSource": "kttm-kafka-supervisor-api",
+      "timestampSpec": {
+        "column": "timestamp",
+        "format": "iso"
+      },
+      "dimensionsSpec": {
+        "dimensions": [
+          "session",
+          "number",
+          "client_ip",
+          "language",
+          "adblock_list",
+          "app_version",
+          "path",
+          "loaded_image",
+          "referrer",
+          "referrer_host",
+          "server_ip",
+          "screen",
+          "window",
+          {
+            "type": "long",
+            "name": "session_length"
+          },
+          "timezone",
+          "timezone_offset",
+          {
+            "type": "json",
+            "name": "event"
+          },
+          {
+            "type": "json",
+            "name": "agent"
+          },
+          {
+            "type": "json",
+            "name": "geo_ip"
+          }
+        ]
+      },
+      "granularitySpec": {
+        "queryGranularity": "none",
+        "rollup": false,
+        "segmentGranularity": "day"
+      }
+    }
+  }
+}
\ No newline at end of file
diff --git a/docs/assets/files/kttm-nested-data.json.tgz 
b/docs/assets/files/kttm-nested-data.json.tgz
new file mode 100644
index 0000000000..d2f5034330
Binary files /dev/null and b/docs/assets/files/kttm-nested-data.json.tgz differ
diff --git a/docs/assets/tutorial-kafka-data-loader-01.png 
b/docs/assets/tutorial-kafka-data-loader-01.png
index 12e2820740..7f8d0daacd 100644
Binary files a/docs/assets/tutorial-kafka-data-loader-01.png and 
b/docs/assets/tutorial-kafka-data-loader-01.png differ
diff --git a/docs/assets/tutorial-kafka-data-loader-02.png 
b/docs/assets/tutorial-kafka-data-loader-02.png
index fb67d06d9b..8475eeba2b 100644
Binary files a/docs/assets/tutorial-kafka-data-loader-02.png and 
b/docs/assets/tutorial-kafka-data-loader-02.png differ
diff --git a/docs/assets/tutorial-kafka-data-loader-03.png 
b/docs/assets/tutorial-kafka-data-loader-03.png
index 022cd75142..dc7400404f 100644
Binary files a/docs/assets/tutorial-kafka-data-loader-03.png and 
b/docs/assets/tutorial-kafka-data-loader-03.png differ
diff --git a/docs/assets/tutorial-kafka-data-loader-04.png 
b/docs/assets/tutorial-kafka-data-loader-04.png
index c6c6da2509..5703066959 100644
Binary files a/docs/assets/tutorial-kafka-data-loader-04.png and 
b/docs/assets/tutorial-kafka-data-loader-04.png differ
diff --git a/docs/assets/tutorial-kafka-data-loader-05.png 
b/docs/assets/tutorial-kafka-data-loader-05.png
index 4b6ba4dd29..4d4efc6c97 100644
Binary files a/docs/assets/tutorial-kafka-data-loader-05.png and 
b/docs/assets/tutorial-kafka-data-loader-05.png differ
diff --git a/docs/assets/tutorial-kafka-data-loader-05b.png 
b/docs/assets/tutorial-kafka-data-loader-05b.png
new file mode 100644
index 0000000000..cc24fbc08e
Binary files /dev/null and b/docs/assets/tutorial-kafka-data-loader-05b.png 
differ
diff --git a/docs/assets/tutorial-kafka-data-loader-06.png 
b/docs/assets/tutorial-kafka-data-loader-06.png
index bcd9567f27..4fb96dd47c 100644
Binary files a/docs/assets/tutorial-kafka-data-loader-06.png and 
b/docs/assets/tutorial-kafka-data-loader-06.png differ
diff --git a/docs/assets/tutorial-kafka-data-loader-07.png 
b/docs/assets/tutorial-kafka-data-loader-07.png
index b2b9c25af1..b3013b735d 100644
Binary files a/docs/assets/tutorial-kafka-data-loader-07.png and 
b/docs/assets/tutorial-kafka-data-loader-07.png differ
diff --git a/docs/assets/tutorial-kafka-data-loader-08.png 
b/docs/assets/tutorial-kafka-data-loader-08.png
index 1b202b4d48..b1cdd2df16 100644
Binary files a/docs/assets/tutorial-kafka-data-loader-08.png and 
b/docs/assets/tutorial-kafka-data-loader-08.png differ
diff --git a/docs/assets/tutorial-kafka-data-loader-09.png 
b/docs/assets/tutorial-kafka-data-loader-09.png
index 1775ecd164..e2045ac895 100644
Binary files a/docs/assets/tutorial-kafka-data-loader-09.png and 
b/docs/assets/tutorial-kafka-data-loader-09.png differ
diff --git a/docs/assets/tutorial-kafka-data-loader-10.png 
b/docs/assets/tutorial-kafka-data-loader-10.png
index cb9c44c623..39eaa3750a 100644
Binary files a/docs/assets/tutorial-kafka-data-loader-10.png and 
b/docs/assets/tutorial-kafka-data-loader-10.png differ
diff --git a/docs/assets/tutorial-kafka-data-loader-11.png 
b/docs/assets/tutorial-kafka-data-loader-11.png
index 9b6aa24cb4..7bd3d9a25e 100644
Binary files a/docs/assets/tutorial-kafka-data-loader-11.png and 
b/docs/assets/tutorial-kafka-data-loader-11.png differ
diff --git a/docs/assets/tutorial-kafka-data-loader-12.png 
b/docs/assets/tutorial-kafka-data-loader-12.png
index c62276daa2..ed952b135b 100644
Binary files a/docs/assets/tutorial-kafka-data-loader-12.png and 
b/docs/assets/tutorial-kafka-data-loader-12.png differ
diff --git a/docs/assets/tutorial-kafka-submit-supervisor-01.png 
b/docs/assets/tutorial-kafka-submit-supervisor-01.png
index debf3e2de3..809c0c6733 100644
Binary files a/docs/assets/tutorial-kafka-submit-supervisor-01.png and 
b/docs/assets/tutorial-kafka-submit-supervisor-01.png differ
diff --git a/docs/tutorials/tutorial-kafka.md b/docs/tutorials/tutorial-kafka.md
index 47e26d637b..eb06f4239f 100644
--- a/docs/tutorials/tutorial-kafka.md
+++ b/docs/tutorials/tutorial-kafka.md
@@ -24,260 +24,274 @@ sidebar_label: "Load from Apache Kafka"
   -->
 
 
-## Getting started
+This tutorial shows you how to load data into Apache Druid from a Kafka 
stream, using Druid's Kafka indexing service. 
 
-This tutorial demonstrates how to load data into Apache Druid from a Kafka 
stream, using Druid's Kafka indexing service.
+The tutorial guides you through the steps to load sample nested clickstream 
data from the [Koalas to the Max](https://www.koalastothemax.com/) game into a 
Kafka topic, then ingest the data into Druid.
 
-For this tutorial, we'll assume you've already downloaded Druid as described in
-the [quickstart](index.md) using the `micro-quickstart` single-machine 
configuration and have it
-running on your local machine. You don't need to have loaded any data yet.
+## Prerequisites
+
+Before you follow the steps in this tutorial, download Druid as described in 
the [quickstart](index.md) using the 
[micro-quickstart](../operations/single-server.md#micro-quickstart-4-cpu-16gib-ram)
 single-machine configuration and have it running on your local machine. You 
don't need to have loaded any data.
 
 ## Download and start Kafka
 
-[Apache Kafka](http://kafka.apache.org/) is a high throughput message bus that 
works well with
-Druid.  For this tutorial, we will use Kafka 2.7.0. To download Kafka, issue 
the following
-commands in your terminal:
+[Apache Kafka](http://kafka.apache.org/) is a high-throughput message bus that 
works well with Druid. For this tutorial, use Kafka 2.7.0. 
+
+1. To download Kafka, run the following commands in your terminal:
 
-```bash
-curl -O https://archive.apache.org/dist/kafka/2.7.0/kafka_2.13-2.7.0.tgz
-tar -xzf kafka_2.13-2.7.0.tgz
-cd kafka_2.13-2.7.0
-```
-Start zookeeper first with the following command:
+   ```bash
+   curl -O https://archive.apache.org/dist/kafka/2.7.0/kafka_2.13-2.7.0.tgz
+   tar -xzf kafka_2.13-2.7.0.tgz
+   cd kafka_2.13-2.7.0
+   ```
+2. If you're already running Kafka on the machine you're using for this 
tutorial, delete or rename the `kafka-logs` directory in `/tmp`.
+   
+   > Druid and Kafka both rely on [Apache 
ZooKeeper](https://zookeeper.apache.org/) to coordinate and manage services. 
Because Druid is already running, Kafka attaches to the Druid ZooKeeper 
instance when it starts up.<br>
+   In a production environment where you're running Druid and Kafka on 
different machines, [start the Kafka 
ZooKeeper](https://kafka.apache.org/quickstart) before you start the Kafka 
broker.
 
-```bash
-./bin/zookeeper-server-start.sh config/zookeeper.properties
-```
+3. In the Kafka root directory, run this command to start a Kafka broker:
 
-Start a Kafka broker by running the following command in a new terminal:
+   ```bash
+   ./bin/kafka-server-start.sh config/server.properties
+   ```
 
-```bash
-./bin/kafka-server-start.sh config/server.properties
-```
+4. In a new terminal window, navigate to the Kafka root directory and run the 
following command to create a Kafka topic called `kttm`:
 
-Run this command to create a Kafka topic called *wikipedia*, to which we'll 
send data:
+   ```bash
+   ./bin/kafka-topics.sh --create --topic kttm --bootstrap-server 
localhost:9092
+   ```
 
-```bash
-./bin/kafka-topics.sh --create --topic wikipedia --bootstrap-server 
localhost:9092
-```     
+   Kafka returns a message when it successfully adds the topic: `Created topic 
kttm`.
 
 ## Load data into Kafka
 
-Let's launch a producer for our topic and send some data!
+In this section, you download sample data to the tutorial's directory and send 
the data to your Kafka topic.
 
-In your Druid directory, run the following command:
+1. In your Kafka root directory, create a directory for the sample data:
 
-```bash
-cd quickstart/tutorial
-gunzip -c wikiticker-2015-09-12-sampled.json.gz > 
wikiticker-2015-09-12-sampled.json
-```
+   ```bash
+   mkdir sample-data
+   ```
 
-In your Kafka directory, run the following command, where {PATH_TO_DRUID} is 
replaced by the path to the Druid directory:
+2. Download the sample data to your new directory and extract it:
 
-```bash
-export KAFKA_OPTS="-Dfile.encoding=UTF-8"
-./bin/kafka-console-producer.sh --broker-list localhost:9092 --topic wikipedia 
< {PATH_TO_DRUID}/quickstart/tutorial/wikiticker-2015-09-12-sampled.json
-```
+   ```bash
+   cd sample-data
+   curl -O 
https://druid.apache.org/docs/latest/assets/files/kttm-nested-data.json.tgz
+   tar -xzf kttm-nested-data.json.tgz
+   ```
 
-The previous command posted sample events to the *wikipedia* Kafka topic.
-Now we will use Druid's Kafka indexing service to ingest messages from our 
newly created topic.
+3. In your Kafka root directory, run the following commands to post sample 
events to the `kttm` Kafka topic:
 
-## Loading data with the data loader
+   ```bash
+   export KAFKA_OPTS="-Dfile.encoding=UTF-8"
+   ./bin/kafka-console-producer.sh --broker-list localhost:9092 --topic kttm < 
./sample-data/kttm-nested-data.json
+   ```
 
-Navigate to [localhost:8888](http://localhost:8888) and click `Load data` in 
the console header.
+## Load data into Druid
 
-![Data loader init](../assets/tutorial-kafka-data-loader-01.png "Data loader 
init")
+Now that you have data in your Kafka topic, you can use Druid's Kafka indexing 
service to ingest the data into Druid. 
 
-Select `Apache Kafka` and click `Connect data`.
+To do this, you can use the Druid console data loader or you can submit a 
supervisor spec. Follow the steps below to try each method.
 
-![Data loader sample](../assets/tutorial-kafka-data-loader-02.png "Data loader 
sample")
+### Load data with the console data loader
 
-Enter `localhost:9092` as the bootstrap server and `wikipedia` as the topic.
+The Druid console data loader presents you with several screens to configure 
each section of the supervisor spec, then creates an ingestion task to ingest 
the Kafka data. 
 
-Click `Apply` and make sure that the data you are seeing is correct.
+To use the console data loader:
 
-Once the data is located, you can click "Next: Parse data" to go to the next 
step.
+1. Navigate to [localhost:8888](http://localhost:8888) and click **Load data > 
Streaming**.
 
-![Data loader parse data](../assets/tutorial-kafka-data-loader-03.png "Data 
loader parse data")
+   ![Data loader init](../assets/tutorial-kafka-data-loader-01.png "Data 
loader init")
 
-The data loader will try to automatically determine the correct parser for the 
data.
-In this case it will successfully determine `json`.
-Feel free to play around with different parser options to get a preview of how 
Druid will parse your data.
+2. Click **Apache Kafka** and then **Connect data**.
 
-With the `json` parser selected, click `Next: Parse time` to get to the step 
centered around determining your primary timestamp column.
+3. Enter `localhost:9092` as the bootstrap server and `kttm` as the topic, 
then click **Apply** and make sure you see data similar to the following:
 
-![Data loader parse time](../assets/tutorial-kafka-data-loader-04.png "Data 
loader parse time")
+   ![Data loader sample](../assets/tutorial-kafka-data-loader-02.png "Data 
loader sample")
 
-Druid's architecture requires a primary timestamp column (internally stored in 
a column called `__time`).
-If you do not have a timestamp in your data, select `Constant value`.
-In our example, the data loader will determine that the `time` column in our 
raw data is the only candidate that can be used as the primary time column.
+4. Click **Next: Parse data**.
 
-Click `Next: ...` twice to go past the `Transform` and `Filter` steps.
-You do not need to enter anything in these steps as applying ingestion time 
transforms and filters are out of scope for this tutorial.
+   ![Data loader parse data](../assets/tutorial-kafka-data-loader-03.png "Data 
loader parse data")
 
-![Data loader schema](../assets/tutorial-kafka-data-loader-05.png "Data loader 
schema")
+   The data loader automatically tries to determine the correct parser for the 
data. For the sample data, it selects input format `json`. You can play around 
with the different options to get a preview of how Druid parses your data.
 
-In the `Configure schema` step, you can configure which 
[dimensions](../ingestion/data-model.md#dimensions) and 
[metrics](../ingestion/data-model.md#metrics) will be ingested into Druid.
-This is exactly what the data will appear like in Druid once it is ingested.
-Since our dataset is very small, go ahead and turn off 
[`Rollup`](../ingestion/rollup.md) by clicking on the switch and confirming the 
change.
+5. With the `json` input format selected, click **Next: Parse time**. You may 
need to click **Apply** first.
 
-Once you are satisfied with the schema, click `Next` to go to the `Partition` 
step where you can fine tune how the data will be partitioned into segments.
+   ![Data loader parse time](../assets/tutorial-kafka-data-loader-04.png "Data 
loader parse time")
 
-![Data loader partition](../assets/tutorial-kafka-data-loader-06.png "Data 
loader partition")
+   Druid's architecture requires that you specify a primary timestamp column. 
Druid stores the timestamp in the `__time` column in your Druid datasource.
+   In a production environment, if you don't have a timestamp in your data, 
you can select **Parse timestamp from:** `None` to use a placeholder value. 
 
-Here, you can adjust how the data will be split up into segments in Druid.
-Since this is a small dataset, there are no adjustments that need to be made 
in this step.
+   For the sample data, the data loader selects the `timestamp` column in the 
raw data as the primary time column.
 
-Click `Next: Tune` to go to the tuning step.
+6. Click **Next: ...** three times to go past the **Transform** and **Filter** 
steps to **Configure schema**. You don't need to enter anything in these two 
steps because applying transforms and filters is out of scope for this tutorial.
 
-![Data loader tune](../assets/tutorial-kafka-data-loader-07.png "Data loader 
tune")
+   ![Data loader schema](../assets/tutorial-kafka-data-loader-05.png "Data 
loader schema")
 
-In the `Tune` step is it *very important* to set `Use earliest offset` to 
`True` since we want to consume the data from the start of the stream.
-There are no other changes that need to be made here, so click `Next: Publish` 
to go to the `Publish` step.
+7. In the **Configure schema** step, you can select data types for the columns 
and configure [dimensions](../ingestion/data-model.md#dimensions) and 
[metrics](../ingestion/data-model.md#metrics) to ingest into Druid. The console 
does most of this for you, but you need to create JSON-type dimensions for the 
three nested columns in the data. 
 
-![Data loader publish](../assets/tutorial-kafka-data-loader-08.png "Data 
loader publish")
+    Click **Add dimension** and enter the following information. You can only 
add one dimension at a time.
+    - Name: `event`, Type: `json`
+    - Name: `agent`, Type: `json`
+    - Name: `geo_ip`, Type: `json`
+  
+    After you create the dimensions, you can scroll to the right in the 
preview window to see the nested columns:
 
-Let's name this datasource `wikipedia-kafka`.
+    ![Nested columns schema](../assets/tutorial-kafka-data-loader-05b.png 
"Nested columns schema")
 
-Finally, click `Next` to review your spec.
+8.  Click **Next: Partition** to configure how Druid partitions the data into 
segments.
 
-![Data loader spec](../assets/tutorial-kafka-data-loader-09.png "Data loader 
spec")
+    ![Data loader partition](../assets/tutorial-kafka-data-loader-06.png "Data 
loader partition")
 
-This is the spec you have constructed.
-Feel free to go back and make changes in previous steps to see how changes 
will update the spec.
-Similarly, you can also edit the spec directly and see it reflected in the 
previous steps.
+9.  Select `day` as the **Segment granularity**. Since this is a small 
dataset, you don't need to make any further adjustments. Click **Next: Tune** 
to fine tune how Druid ingests data.
+   
+    ![Data loader tune](../assets/tutorial-kafka-data-loader-07.png "Data 
loader tune")
 
-Once you are satisfied with the spec, click `Submit` and an ingestion task 
will be created.
+10. In **Input tuning**, set **Use earliest offset** to `True`&mdash;this is 
very  important because you want to consume the data from the start of the 
stream. There are no other changes to make here, so click **Next: Publish**.
 
-![Tasks view](../assets/tutorial-kafka-data-loader-10.png "Tasks view")
+    ![Data loader publish](../assets/tutorial-kafka-data-loader-08.png "Data 
loader publish")
 
-You will be taken to the task view with the focus on the newly created 
supervisor.
+11. Name the datasource `kttm-kafka` and click **Next: Edit spec** to review 
your spec.
 
-The task view is set to auto refresh, wait until your supervisor launches a 
task.
+    ![Data loader spec](../assets/tutorial-kafka-data-loader-09.png "Data 
loader spec")
 
-When a tasks starts running, it will also start serving the data that it is 
ingesting.
+    The console presents the spec you've constructed. You can click the 
buttons above the spec to make changes in previous steps and see how the 
changes update the spec. You can also edit the spec directly and see it 
reflected in the previous steps.
+   
+12. Click **Submit** to create an ingestion task.
 
-Navigate to the `Datasources` view from the header.
+    Druid displays the task view with the focus on the newly created 
supervisor.
 
-![Datasource view](../assets/tutorial-kafka-data-loader-11.png "Datasource 
view")
+    The task view auto-refreshes, so wait until the supervisor launches a 
task. The status changes from **Pending** to **Running** as Druid starts to 
ingest data.
 
-When the `wikipedia-kafka` datasource appears here it can be queried. 
+    ![Tasks view](../assets/tutorial-kafka-data-loader-10.png "Tasks view")
 
-*Note:* if the datasource does not appear after a minute you might have not 
set the supervisor to read from the start of the stream (in the `Tune` step).
+13. Navigate to the **Datasources** view from the header.
 
-At this point, you can go to the `Query` view to run SQL queries against the 
datasource.
+    ![Datasource view](../assets/tutorial-kafka-data-loader-11.png "Datasource 
view")
 
-Since this is a small dataset, you can simply run a `SELECT * FROM 
"wikipedia-kafka"` query to see your results.
+    When the `kttm-kafka` datasource appears here, you can query it. See 
[Query your data](#query-your-data) for details.
 
-![Query view](../assets/tutorial-kafka-data-loader-12.png "Query view")
+    > If the datasource doesn't appear after a minute you might not have set 
the supervisor to read data from the start of the stream&mdash;the `Use 
earliest offset` setting in the **Tune** step. Go to the **Ingestion** page and 
terminate the supervisor using the **Actions(...)** menu. [Load the sample 
data](#load-data-with-the-console-data-loader) again and apply the correct 
setting when you get to the **Tune** step.
+
+### Submit a supervisor spec
+
+As an alternative to using the data loader, you can submit a supervisor spec 
to Druid. You can do this in the console or using the Druid API.
+
+#### Use the console
 
-Check out the [query tutorial](../tutorials/tutorial-query.md) to run some 
example queries on the newly loaded data.
-
-
-### Submit a supervisor via the console
-
-In the console, click `Submit supervisor` to open the submit supervisor dialog.
-
-![Submit supervisor](../assets/tutorial-kafka-submit-supervisor-01.png "Submit 
supervisor")
-
-Paste in this spec and click `Submit`.
-
-```json
-{
-  "type": "kafka",
-  "spec" : {
-    "dataSchema": {
-      "dataSource": "wikipedia",
-      "timestampSpec": {
-        "column": "time",
-        "format": "auto"
-      },
-      "dimensionsSpec": {
-        "dimensions": [
-          "channel",
-          "cityName",
-          "comment",
-          "countryIsoCode",
-          "countryName",
-          "isAnonymous",
-          "isMinor",
-          "isNew",
-          "isRobot",
-          "isUnpatrolled",
-          "metroCode",
-          "namespace",
-          "page",
-          "regionIsoCode",
-          "regionName",
-          "user",
-          { "name": "added", "type": "long" },
-          { "name": "deleted", "type": "long" },
-          { "name": "delta", "type": "long" }
-        ]
-      },
-      "metricsSpec" : [],
-      "granularitySpec": {
-        "type": "uniform",
-        "segmentGranularity": "DAY",
-        "queryGranularity": "NONE",
-        "rollup": false
-      }
-    },
-    "tuningConfig": {
-      "type": "kafka",
-      "reportParseExceptions": false
-    },
-    "ioConfig": {
-      "topic": "wikipedia",
-      "inputFormat": {
-        "type": "json"
-      },
-      "replicas": 2,
-      "taskDuration": "PT10M",
-      "completionTimeout": "PT20M",
-      "consumerProperties": {
-        "bootstrap.servers": "localhost:9092"
-      }
-    }
-  }
-}
-```
-
-This will start the supervisor that will in turn spawn some tasks that will 
start listening for incoming data.
-
-### Submit a supervisor directly
-
-To start the service directly, we will need to submit a supervisor spec to the 
Druid overlord by running the following from the Druid package root:
-
-```bash
-curl -XPOST -H'Content-Type: application/json' -d 
@quickstart/tutorial/wikipedia-kafka-supervisor.json 
http://localhost:8081/druid/indexer/v1/supervisor
-```
-
-
-If the supervisor was successfully created, you will get a response containing 
the ID of the supervisor; in our case we should see `{"id":"wikipedia"}`.
-
-For more details about what's going on here, check out the
-[Druid Kafka indexing service 
documentation](../development/extensions-core/kafka-ingestion.md).
-
-You can view the current supervisors and tasks in the web console: 
[http://localhost:8888/unified-console.md#tasks](http://localhost:8888/unified-console.html#tasks).
-
-## Querying your data
-
-After data is sent to the Kafka stream, it is immediately available for 
querying.
-
-Please follow the [query tutorial](../tutorials/tutorial-query.md) to run some 
example queries on the newly loaded data.
-
-## Cleanup
-
-To go through any of the other ingestion tutorials, you will need to shut down 
the cluster and reset the cluster state by removing the contents of the `var` 
directory in the Druid home, as the other tutorials will write to the same 
"wikipedia" datasource.
-
-You should additionally clear out any Kafka state. Do so by shutting down the 
Kafka broker with CTRL-C before stopping ZooKeeper and the Druid services, and 
then deleting the Kafka log directory at `/tmp/kafka-logs`:
-
-```bash
-rm -rf /tmp/kafka-logs
-```
+To submit a supervisor spec using the Druid console:
 
+1. Click **Ingestion** in the console, then click the ellipses next to the 
refresh button and select **Submit JSON supervisor**.
+
+2. Paste this spec into the JSON window and click **Submit**.
+   ```json
+   {
+     "type": "kafka",
+     "spec": {
+       "ioConfig": {
+         "type": "kafka",
+         "consumerProperties": {
+           "bootstrap.servers": "localhost:9092"
+         },
+         "topic": "kttm",
+         "inputFormat": {
+           "type": "json"
+         },
+         "useEarliestOffset": true
+       },
+       "tuningConfig": {
+         "type": "kafka"
+       },
+       "dataSchema": {
+         "dataSource": "kttm-kafka-supervisor-console",
+         "timestampSpec": {
+           "column": "timestamp",
+           "format": "iso"
+         },
+         "dimensionsSpec": {
+           "dimensions": [
+             "session",
+             "number",
+             "client_ip",
+             "language",
+             "adblock_list",
+             "app_version",
+             "path",
+             "loaded_image",
+             "referrer",
+             "referrer_host",
+             "server_ip",
+             "screen",
+             "window",
+             {
+               "type": "long",
+               "name": "session_length"
+             },
+             "timezone",
+             "timezone_offset",
+             {
+               "type": "json",
+               "name": "event"
+             },
+             {
+               "type": "json",
+               "name": "agent"
+             },
+             {
+               "type": "json",
+               "name": "geo_ip"
+             }
+           ]
+         },
+         "granularitySpec": {
+           "queryGranularity": "none",
+           "rollup": false,
+           "segmentGranularity": "day"
+         }
+       }
+     }
+   }
+   ```
+
+
+   This starts the supervisor&mdash;the supervisor spawns tasks that start 
listening for incoming data.
+
+3. Click **Tasks** on the console home page to monitor the status of the job. 
This spec writes the data in the `kttm` topic to a datasource named 
`kttm-kafka-supervisor-console`.
+
+#### Use the API
+
+You can also use the Druid API to submit a supervisor spec.
+
+1. Run the following command to download the sample spec:
+
+   ```bash
+   curl -O 
https://druid.apache.org/docs/latest/assets/files/kttm-kafka-supervisor.json
+   ```
+
+2. Run the following command to submit the spec in the 
`kttm-kafka-supervisor.json` file:
+
+    ```bash
+    curl -XPOST -H 'Content-Type: application/json' kttm-kafka-supervisor.json 
http://localhost:8081/druid/indexer/v1/supervisor
+    ```
+
+    After Druid successfully creates the supervisor, you get a response 
containing the supervisor ID: `{"id":"kttm-kafka-supervisor-api"}`.
+
+3. Click **Tasks** on the console home page to monitor the status of the job. 
This spec writes the data in the `kttm` topic to a datasource named 
`kttm-kafka-supervisor-api`.
+
+## Query your data
+
+After Druid sends data to the Kafka stream, it is immediately available for 
querying. Click **Query** in the Druid console to run SQL queries against the 
datasource.
+
+Since this tutorial ingests a small dataset, you can run the query `SELECT * 
FROM "kttm-kafka"` to return all of the data in the dataset you created.
+
+![Query view](../assets/tutorial-kafka-data-loader-12.png "Query view")
+
+Check out the [Querying data tutorial](../tutorials/tutorial-query.md) to run 
some example queries on the newly loaded data.
 
 ## Further reading
 
-For more information on loading data from Kafka streams, please see the [Druid 
Kafka indexing service 
documentation](../development/extensions-core/kafka-ingestion.md).
+For more information, see the following topics:
+
+- [Apache Kafka ingestion](../development/extensions-core/kafka-ingestion.md) 
for more information on loading data from Kafka streams.
+- [Apache Kafka supervisor 
reference](../development/extensions-core/kafka-supervisor-reference.md) for 
Kafka supervisor configuration information.
+- [Apache Kafka supervisor operations 
reference](../development/extensions-core/kafka-supervisor-operations.md) for 
information on running and maintaining Kafka supervisors for Druid.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[druid] branch master updated: Update Kafka ingestion tutorial (#13261)

Reply via email to