fjy commented on a change in pull request #8544: Update Kafka loading docs to
use the streaming data loader
URL: https://github.com/apache/incubator-druid/pull/8544#discussion_r324764184
##########
File path: docs/tutorials/tutorial-kafka.md
##########
@@ -54,17 +54,126 @@ Run this command to create a Kafka topic called
*wikipedia*, to which we'll send
```bash
./bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor
1 --partitions 1 --topic wikipedia
+```
+
+## Load data into Kafka
+
+Let's launch a producer for our topic and send some data!
+
+In your Druid directory, run the following command:
+
+```bash
+cd quickstart/tutorial
+gunzip -c wikiticker-2015-09-12-sampled.json.gz >
wikiticker-2015-09-12-sampled.json
```
-## Start Druid Kafka ingestion
+In your Kafka directory, run the following command, where {PATH_TO_DRUID} is
replaced by the path to the Druid directory:
+
+```bash
+export KAFKA_OPTS="-Dfile.encoding=UTF-8"
+./bin/kafka-console-producer.sh --broker-list localhost:9092 --topic wikipedia
< {PATH_TO_DRUID}/quickstart/tutorial/wikiticker-2015-09-12-sampled.json
+```
+
+The previous command posted sample events to the *wikipedia* Kafka topic.
+Now we will use Druid's Kafka indexing service to ingest messages from our
newly created topic.
+
+## Loading data with the data loader
+
+Navigate to [localhost:8888](http://localhost:8888) and click `Load data` in
the console header.
+
+
+
+Select `Apache Kafka` and click `Connect data`.
+
+
+
+Enter `localhost:9092` as the bootsrap server and `wikipedia` as the topic.
+
+Click `Preview` and make sure that the the data you are seeing is correct.
+
+Once the data is located, you can click "Next: Parse data" to go to the next
step.
+
+
+
+The data loader will try to automatically determine the correct parser for the
data.
+In this case it will successfully determine `json`.
+Feel free to play around with different parser options to get a preview of how
Druid will parse your data.
+
+With the `json` parser selected, click `Next: Parse time` to get to the step
centered around determining your primary timestamp column.
+
+
+
+Druid's architecture requires a primary timestamp column (internally stored in
a column called `__time`).
+If you do not have a timestamp in your data, select `Constant value`.
+In our example, the data loader will determine that the `time` column in our
raw data is the only candidate that can be used as the primary time column.
+
+Click `Next: ...` twice to go past the `Transform` and `Filter` steps.
+You do not need to enter anything in these steps as applying ingestion time
transforms and filters are out of scope for this tutorial.
+
+
+
+In the `Configure schema` step, you can configure which dimensions (and
metrics) will be ingested into Druid.
Review comment:
I would provide a link to what dimension/metrics mean and how to fine to
your data schema.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]