[GitHub] [druid] writer-jill commented on a diff in pull request #13261: Update Kafka ingestion tutorial

GitBox Wed, 09 Nov 2022 01:55:12 -0800


writer-jill commented on code in PR #13261:
URL: https://github.com/apache/druid/pull/13261#discussion_r1017712282



##########
docs/tutorials/tutorial-kafka.md:
##########
@@ -24,260 +24,267 @@ sidebar_label: "Load from Apache Kafka"
   -->
 
 
-## Getting started
+This tutorial shows you how to load data into Apache Druid from a Kafka 
stream, using Druid's Kafka indexing service. 
 
-This tutorial demonstrates how to load data into Apache Druid from a Kafka 
stream, using Druid's Kafka indexing service.
+The tutorial guides you through the steps to load sample nested clickstream 
data from the [Koalas to the Max](https://www.koalastothemax.com/) game into a 
Kafka topic, then ingest the data into Druid.
 
-For this tutorial, we'll assume you've already downloaded Druid as described in
-the [quickstart](index.md) using the `micro-quickstart` single-machine 
configuration and have it
-running on your local machine. You don't need to have loaded any data yet.
+## Prerequisites
+
+Before you follow the steps in this tutorial, download Druid as described in 
the [quickstart](index.md) using the 
[micro-quickstart](../operations/single-server.md#micro-quickstart-4-cpu-16gib-ram)
 single-machine configuration and have it running on your local machine. You 
don't need to have loaded any data.
 
 ## Download and start Kafka
 
-[Apache Kafka](http://kafka.apache.org/) is a high throughput message bus that 
works well with
-Druid.  For this tutorial, we will use Kafka 2.7.0. To download Kafka, issue 
the following
-commands in your terminal:
+[Apache Kafka](http://kafka.apache.org/) is a high-throughput message bus that 
works well with Druid. For this tutorial, use Kafka 2.7.0. 
+
+1. To download Kafka, run the following commands in your terminal:
 
-```bash
-curl -O https://archive.apache.org/dist/kafka/2.7.0/kafka_2.13-2.7.0.tgz
-tar -xzf kafka_2.13-2.7.0.tgz
-cd kafka_2.13-2.7.0
-```
-Start zookeeper first with the following command:
+   ```bash
+   curl -O https://archive.apache.org/dist/kafka/2.7.0/kafka_2.13-2.7.0.tgz
+   tar -xzf kafka_2.13-2.7.0.tgz
+   cd kafka_2.13-2.7.0
+   ```
+2. If you're already running Kafka on the machine you're using for this 
tutorial, delete or rename the `kafka-logs` directory in `/tmp`.
+   
+   > Druid and Kafka both rely on [Apache 
ZooKeeper](https://zookeeper.apache.org/) to coordinate and manage services. 
Because Druid is already running, Kafka attaches to the Druid ZooKeeper 
instance when it starts up.<br>
+   In a production environment where you're running Druid and Kafka on 
different machines, [start the Kafka 
ZooKeeper](https://kafka.apache.org/quickstart) before you start the Kafka 
broker.
 
-```bash
-./bin/zookeeper-server-start.sh config/zookeeper.properties
-```
+3. In the Kafka root directory, run this command to start a Kafka broker:
 
-Start a Kafka broker by running the following command in a new terminal:
+   ```bash
+   ./bin/kafka-server-start.sh config/server.properties
+   ```
 
-```bash
-./bin/kafka-server-start.sh config/server.properties
-```
+4. In a new terminal window, navigate to the Kafka root directory and run the 
following command to create a Kafka topic called `kttm`:
 
-Run this command to create a Kafka topic called *wikipedia*, to which we'll 
send data:
+   ```bash
+   ./bin/kafka-topics.sh --create --topic kttm --bootstrap-server 
localhost:9092
+   ```
 
-```bash
-./bin/kafka-topics.sh --create --topic wikipedia --bootstrap-server 
localhost:9092
-```     
+   Kafka returns a message when it successfully adds the topic: `Created topic 
kttm`.
 
 ## Load data into Kafka
 
-Let's launch a producer for our topic and send some data!
+In this section, you download sample data to the tutorial's directory and send 
the data to your Kafka topic.
 
-In your Druid directory, run the following command:
+1. Run the following commands from your Druid root directory to download and 
extract the sample spec:
 
-```bash
-cd quickstart/tutorial
-gunzip -c wikiticker-2015-09-12-sampled.json.gz > 
wikiticker-2015-09-12-sampled.json
-```
+   ```bash
+   curl -O 
https://druid.apache.org/docs/latest/assets/files/kttm-nested-data.json.gz
+   tar -xzf kttm-nested-data.json.gz
+   ```
 
-In your Kafka directory, run the following command, where {PATH_TO_DRUID} is 
replaced by the path to the Druid directory:
+2. In your Kafka root directory, run the following commands to post sample 
events to the `kttm` Kafka topic. Replace `{PATH_TO_DRUID}` with the path to 
your Druid root directory:
 
-```bash
-export KAFKA_OPTS="-Dfile.encoding=UTF-8"
-./bin/kafka-console-producer.sh --broker-list localhost:9092 --topic wikipedia 
< {PATH_TO_DRUID}/quickstart/tutorial/wikiticker-2015-09-12-sampled.json
-```
+   ```bash
+   export KAFKA_OPTS="-Dfile.encoding=UTF-8" 
+   ./bin/kafka-console-producer.sh --broker-list localhost:9092 --topic kttm < 
{PATH_TO_DRUID}/kttm-nested-data.json
+   ```
 
-The previous command posted sample events to the *wikipedia* Kafka topic.
-Now we will use Druid's Kafka indexing service to ingest messages from our 
newly created topic.
+## Load data into Druid
 
-## Loading data with the data loader
+Now that you have data in your Kafka topic, you can use Druid's Kafka indexing 
service to ingest the data into Druid. 
 
-Navigate to [localhost:8888](http://localhost:8888) and click `Load data` in 
the console header.
+To do this, you can use the Druid console data loader or you can submit a 
supervisor spec. Follow the steps below to try each method.
 
-![Data loader init](../assets/tutorial-kafka-data-loader-01.png "Data loader 
init")
+### Load data with the console data loader
 
-Select `Apache Kafka` and click `Connect data`.
+The Druid console data loader presents you with several screens to configure 
each section of the supervisor spec, then creates an ingestion task to ingest 
the Kafka data. 
 
-![Data loader sample](../assets/tutorial-kafka-data-loader-02.png "Data loader 
sample")
+To use the console data loader:
 
-Enter `localhost:9092` as the bootstrap server and `wikipedia` as the topic.
+1. Navigate to [localhost:8888](http://localhost:8888) and click **Load data > 
Streaming**.
 
-Click `Apply` and make sure that the data you are seeing is correct.
+   ![Data loader init](../assets/tutorial-kafka-data-loader-01.png "Data 
loader init")
 
-Once the data is located, you can click "Next: Parse data" to go to the next 
step.
+2. Click **Apache Kafka** and then **Connect data**.
 
-![Data loader parse data](../assets/tutorial-kafka-data-loader-03.png "Data 
loader parse data")
+3. Enter `localhost:9092` as the bootstrap server and `kttm` as the topic, 
then click **Apply** and make sure you see data similar to the following:
 
-The data loader will try to automatically determine the correct parser for the 
data.
-In this case it will successfully determine `json`.
-Feel free to play around with different parser options to get a preview of how 
Druid will parse your data.
+   ![Data loader sample](../assets/tutorial-kafka-data-loader-02.png "Data 
loader sample")
 
-With the `json` parser selected, click `Next: Parse time` to get to the step 
centered around determining your primary timestamp column.
+4. Click **Next: Parse data**.
 
-![Data loader parse time](../assets/tutorial-kafka-data-loader-04.png "Data 
loader parse time")
+   ![Data loader parse data](../assets/tutorial-kafka-data-loader-03.png "Data 
loader parse data")
 
-Druid's architecture requires a primary timestamp column (internally stored in 
a column called `__time`).
-If you do not have a timestamp in your data, select `Constant value`.
-In our example, the data loader will determine that the `time` column in our 
raw data is the only candidate that can be used as the primary time column.
+   The data loader automatically tries to determine the correct parser for the 
data. For the sample data, it selects input format `json`. You can play around 
with the different options to get a preview of how Druid parses your data.
 
-Click `Next: ...` twice to go past the `Transform` and `Filter` steps.
-You do not need to enter anything in these steps as applying ingestion time 
transforms and filters are out of scope for this tutorial.
+5. With the `json` input format selected, click **Next: Parse time**. You may 
need to click **Apply** first.
 
-![Data loader schema](../assets/tutorial-kafka-data-loader-05.png "Data loader 
schema")
+   ![Data loader parse time](../assets/tutorial-kafka-data-loader-04.png "Data 
loader parse time")
 
-In the `Configure schema` step, you can configure which 
[dimensions](../ingestion/data-model.md#dimensions) and 
[metrics](../ingestion/data-model.md#metrics) will be ingested into Druid.
-This is exactly what the data will appear like in Druid once it is ingested.
-Since our dataset is very small, go ahead and turn off 
[`Rollup`](../ingestion/rollup.md) by clicking on the switch and confirming the 
change.
+   Druid's architecture requires that you specify a primary timestamp column. 
Druid stores the timestamp in the `__time`) column in your Druid datasource.
+   In a production environment, if you don't have a timestamp in your data, 
you can select **Parse timestamp from:** `none` to use a placeholder value. 
 
-Once you are satisfied with the schema, click `Next` to go to the `Partition` 
step where you can fine tune how the data will be partitioned into segments.
+   For the sample data, the data loader selects the `timestamp` column in the 
raw data as the primary time column.
 
-![Data loader partition](../assets/tutorial-kafka-data-loader-06.png "Data 
loader partition")
+6. Click **Next: ...** three times to go past the **Transform** and **Filter** 
steps to **Configure schema**. You don't need to enter anything in these two 
steps because applying transforms and filters is out of scope for this tutorial.
 
-Here, you can adjust how the data will be split up into segments in Druid.
-Since this is a small dataset, there are no adjustments that need to be made 
in this step.
+   ![Data loader schema](../assets/tutorial-kafka-data-loader-05.png "Data 
loader schema")
 
-Click `Next: Tune` to go to the tuning step.
+7. In the **Configure schema** step, you can select data types for the columns 
and configure [dimensions](../ingestion/data-model.md#dimensions) and 
[metrics](../ingestion/data-model.md#metrics) to ingest into Druid. The sample 
data contains three nested columns, so you need to create JSON-type dimensions 
for them. 
 
-![Data loader tune](../assets/tutorial-kafka-data-loader-07.png "Data loader 
tune")
+    Click **Add dimension** and enter the following information. You can only 
add one dimension at a time.
+    - Name: `event`, Type: `json`
+    - Name: `agent`, Type: `json`
+    - Name: `geo_ip`, Type: `json`
+  
+    After you create the dimensions, you can scroll to the right in the 
preview window to see the nested columns:
 
-In the `Tune` step is it *very important* to set `Use earliest offset` to 
`True` since we want to consume the data from the start of the stream.
-There are no other changes that need to be made here, so click `Next: Publish` 
to go to the `Publish` step.
+    ![Nested columns schema](../assets/tutorial-kafka-data-loader-05b.png 
"Nested columns schema")
 
-![Data loader publish](../assets/tutorial-kafka-data-loader-08.png "Data 
loader publish")
+8.  Click **Next: Partition** to configure how Druid partitions the data into 
segments.
 
-Let's name this datasource `wikipedia-kafka`.
+    ![Data loader partition](../assets/tutorial-kafka-data-loader-06.png "Data 
loader partition")
 
-Finally, click `Next` to review your spec.
+9.  Select `day` as the **Segment granularity**. Since this is a small 
dataset, you don't need to make any further adjustments. Click **Next: Tune** 
to finetune how Druid ingests data.
+   
+    ![Data loader tune](../assets/tutorial-kafka-data-loader-07.png "Data 
loader tune")
 
-![Data loader spec](../assets/tutorial-kafka-data-loader-09.png "Data loader 
spec")
+10. In **Input tuning**, set **Use earliest offset** to `True`&mdash;this is 
very  important because you want to consume the data from the start of the 
stream. There are no other changes to make here, so click **Next: Publish**.
 
-This is the spec you have constructed.
-Feel free to go back and make changes in previous steps to see how changes 
will update the spec.
-Similarly, you can also edit the spec directly and see it reflected in the 
previous steps.
+    ![Data loader publish](../assets/tutorial-kafka-data-loader-08.png "Data 
loader publish")
 
-Once you are satisfied with the spec, click `Submit` and an ingestion task 
will be created.
+11. Name the datasource `kttm-kafka` and click **Next: Edit spec** to review 
your spec.
 
-![Tasks view](../assets/tutorial-kafka-data-loader-10.png "Tasks view")
+    ![Data loader spec](../assets/tutorial-kafka-data-loader-09.png "Data 
loader spec")
 
-You will be taken to the task view with the focus on the newly created 
supervisor.
+    The console presents the spec you've constructed. You can click the 
buttons above the spec to make changes in previous steps and see how the 
changes update the spec. You can also edit the spec directly and see it 
reflected in the previous steps.
+   
+12. Click **Submit** to create an ingestion task.
 
-The task view is set to auto refresh, wait until your supervisor launches a 
task.
+    Druid displays the task view with the focus on the newly created 
supervisor.
 
-When a tasks starts running, it will also start serving the data that it is 
ingesting.
+    The task view auto-refreshes, so wait until the supervisor launches a 
task. The status changes from **Pending** to **Running** as Druid starts to 
ingest data.
 
-Navigate to the `Datasources` view from the header.
+    ![Tasks view](../assets/tutorial-kafka-data-loader-10.png "Tasks view")
 
-![Datasource view](../assets/tutorial-kafka-data-loader-11.png "Datasource 
view")
+13. Navigate to the **Datasources** view from the header.
 
-When the `wikipedia-kafka` datasource appears here it can be queried. 
+    ![Datasource view](../assets/tutorial-kafka-data-loader-11.png "Datasource 
view")
 
-*Note:* if the datasource does not appear after a minute you might have not 
set the supervisor to read from the start of the stream (in the `Tune` step).
+    When the `kttm-kafka` datasource appears here, you can query it. See 
[Query your data](#query-your-data) for details.
 
-At this point, you can go to the `Query` view to run SQL queries against the 
datasource.
+    > If the datasource doesn't appear after a minute you might not have set 
the supervisor to read data from the start of the stream&mdash;the `Use 
earliest offset` setting in the **Tune** step. Go to the **Ingestion** page and 
terminate the supervisor using the **Actions** menu. [Load the sample 
data](#load-data-with-the-console-data-loader) again and apply the correct 
setting when you get to the **Tune** step.

Review Comment:
   I don't like including the ellipsis and brackets in the instruction. I think 
it's clear what it's referring to in the UI. I'll add it to our style guide 
discussion list.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] writer-jill commented on a diff in pull request #13261: Update Kafka ingestion tutorial

Reply via email to