[GitHub] jon-wei commented on a change in pull request #6126: New quickstart and tutorials

GitBox Thu, 09 Aug 2018 00:33:59 -0700

jon-wei commented on a change in pull request #6126: New quickstart and 
tutorials
URL: https://github.com/apache/incubator-druid/pull/6126#discussion_r208831377


 ##########
 File path: docs/content/tutorials/tutorial-kafka.md
 ##########
 @@ -37,149 +30,56 @@ Start a Kafka broker by running the following command in 
a new terminal:
 ./bin/kafka-server-start.sh config/server.properties
 ```
 
-Run this command to create a Kafka topic called *metrics*, to which we'll send 
data:
+Run this command to create a Kafka topic called *wikipedia*, to which we'll 
send data:
 
 ```bash
-./bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 
1 --partitions 1 --topic metrics
+./bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 
1 --partitions 1 --topic wikipedia
 ```
 
-## Send example data
-
-Let's launch a console producer for our topic and send some data!
-
-In your Druid directory, generate some metrics by running:
-
-```bash
-bin/generate-example-metrics
-```
+## Enable Druid Kafka ingestion
 
-In your Kafka directory, run:
+We will use Druid's Kafka indexing service to ingest messages from our newly 
created *wikipedia* topic. To start the
+service, we will need to submit a supervisor spec to the Druid overlord by 
running the following from the Imply directory:
 
 ```bash
-./bin/kafka-console-producer.sh --broker-list localhost:9092 --topic metrics
+curl -XPOST -H'Content-Type: application/json' -d 
@quickstart/tutorial/wikipedia-kafka-supervisor.json 
http://localhost:8090/druid/indexer/v1/supervisor
 ```
 
-The *kafka-console-producer* command is now awaiting input. Copy the generated 
example metrics,
-paste them into the *kafka-console-producer* terminal, and press enter. If you 
like, you can also
-paste more messages into the producer, or you can press CTRL-D to exit the 
console producer.
-
-You can immediately query this data, or you can skip ahead to the
-[Loading your own data](#loading-your-own-data) section if you'd like to load 
your own dataset.
-
-## Querying your data
-
-After sending data, you can immediately query it using any of the
-[supported query methods](../querying/querying.html).
-
-## Loading your own data
-
-So far, you've loaded data into Druid from Kafka using an ingestion spec that 
we've included in the
-distribution. Each ingestion spec is designed to work with a particular 
dataset. You load your own
-data types into Imply by writing a custom ingestion spec.
+If the supervisor was successfully created, you will get a response containing 
the ID of the supervisor; in our case we should see `{"id":"wikipedia-kafka"}`.
 
-You can write a custom ingestion spec by starting from the bundled 
configuration in
-`conf-quickstart/tranquility/kafka.json` and modifying it for your own needs.
+For more details about what's going on here, check out the
+[Druid Kafka indexing service 
documentation](http://druid.io/docs/{{druidVersion}}/development/extensions-core/kafka-ingestion.html).
 
-The most important questions are:
+## Load data
 
-  * What should the dataset be called? This is the "dataSource" field of the 
"dataSchema".
-  * Which field should be treated as a timestamp? This belongs in the "column" 
of the "timestampSpec".
-  * Which fields should be treated as dimensions? This belongs in the 
"dimensions" of the "dimensionsSpec".
-  * Which fields should be treated as measures? This belongs in the 
"metricsSpec".
+Let's launch a console producer for our topic and send some data!
 
-Let's use a small JSON pageviews dataset in the topic *pageviews* as an 
example, with records like:
+In your Druid directory, run the following command:
 
-```json
-{"time": "2000-01-01T00:00:00Z", "url": "/foo/bar", "user": "alice", 
"latencyMs": 32}
 ```
-
-First, create the topic:
-
-```bash
-./bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 
1 --partitions 1 --topic pageviews
+cd quickstart
+gunzip -k wikipedia-2015-09-12-sampled.json.gz
 ```
 
-Next, edit `conf-quickstart/tranquility/kafka.json`:
-
-  * Let's call the dataset "pageviews-kafka".
-  * The timestamp is the "time" field.
-  * Good choices for dimensions are the string fields "url" and "user".
-  * Good choices for measures are a count of pageviews, and the sum of 
"latencyMs". Collecting that
-sum when we load the data will allow us to compute an average at query time as 
well.
-
-You can edit the existing `conf-quickstart/tranquility/kafka.json` file by 
altering these
-sections:
-
-  1. Change the key `"metrics-kafka"` under `"dataSources"` to 
`"pageviews-kafka"`
-  2. Alter these sections under the new `"pageviews-kafka"` key:
-  ```json
-  "dataSource": "pageviews-kafka"
-  ```
-
-  ```json
-  "timestampSpec": {
-       "format": "auto",
-       "column": "time"
-  }
-  ```
-
-  ```json
-  "dimensionsSpec": {
-       "dimensions": ["url", "user"]
-  }
-  ```
-
-  ```json
-  "metricsSpec": [
-       {"name": "views", "type": "count"},
-       {"name": "latencyMs", "type": "doubleSum", "fieldName": "latencyMs"}
-  ]
-  ```
-
-  ```json
-  "properties" : {
-       "task.partitions" : "1",
-       "task.replicants" : "1",
-       "topicPattern" : "pageviews"
-  }
-  ```
-
-Next, start Druid Kafka ingestion:
+In your Kafka directory, run the following command, where {PATH_TO_DRUID} is 
replaced by the path to the Druid directory:
 
 ```bash
-bin/tranquility kafka -configFile 
../druid-#{DRUIDVERSION}/conf-quickstart/tranquility/kafka.json
+export KAFKA_OPTS="-Dfile.encoding=UTF-8"
+./bin/kafka-console-producer.sh --broker-list localhost:9092 --topic wikipedia 
< {PATH_TO_DRUID}/quickstart/wikipedia-2015-09-12-sampled.json
 ```
 
-- If your Tranquility server or Kafka is already running, stop it (CTRL-C) and
-start it up again.
-
-Finally, send some data to the Kafka topic. Let's start with these messages:
-
-```json
-{"time": "2000-01-01T00:00:00Z", "url": "/foo/bar", "user": "alice", 
"latencyMs": 32}
-{"time": "2000-01-01T00:00:00Z", "url": "/", "user": "bob", "latencyMs": 11}
-{"time": "2000-01-01T00:00:00Z", "url": "/foo/bar", "user": "bob", 
"latencyMs": 45}
-```
+The previous command posted sample events to the *wikipedia* Kafka topic which 
were then ingested into Druid by the Kafka indexing service. You're now ready 
to run some queries!
 
-Druid streaming ingestion requires relatively current messages (relative to a 
slack time controlled by the
-[windowPeriod](../ingestion/stream-ingestion.html#segmentgranularity-and-windowperiod)
 value), so you should
-replace `2000-01-01T00:00:00Z` in these messages with the current time in 
ISO8601 format. You can
-get this by running:
+## Querying your data
 
-```bash
-python -c 'import datetime; 
print(datetime.datetime.utcnow().strftime("%Y-%m-%dT%H:%M:%SZ"))'
-```
+After data is sent to the Kafka stream, it is immediately available for 
querying.
 
-Update the timestamps in the JSON above, then copy and paste these messages 
into this console
-producer and press enter:
+Please follow the [query tutorial](../tutorial/tutorial-query.html) to run 
some example queries on the newly loaded data.
 
-```bash
-./bin/kafka-console-producer.sh --broker-list localhost:9092 --topic pageviews
-```
+## Cleanup
 
-That's it, your data should now be in Druid. You can immediately query it 
using any of the
-[supported query methods](../querying/querying.html).
+If you wish to go through any of the other ingestion tutorials, you will need 
to shut down the cluster and reset the cluster state by removing the contents 
of the `var` directory under the druid package, as the other tutorials will 
write to the same "wikipedia" datasource.
 
 ## Further reading
 
-To read more about loading streams, see our [streaming ingestion 
documentation](../ingestion/stream-ingestion.html).
+For more information on loading data from Kafka streams, please see the [Druid 
Kafka indexing service 
documentation](http://druid.io/docs/{{druidVersion}}/development/extensions-core/kafka-ingestion.html)..
 
 Review comment:
   Fixed

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] jon-wei commented on a change in pull request #6126: New quickstart and tutorials

Reply via email to