[GitHub] [druid] sthetland commented on a change in pull request #9766: Druid Quickstart refactor and update

GitBox Mon, 27 Apr 2020 11:35:47 -0700


sthetland commented on a change in pull request #9766:
URL: https://github.com/apache/druid/pull/9766#discussion_r416053995




##########
File path: docs/tutorials/index.md
##########
@@ -99,96 +91,173 @@ $ ./bin/start-micro-quickstart
 [Fri May  3 11:40:50 2019] Running command[middleManager], logging 
to[/apache-druid-{{DRUIDVERSION}}/var/sv/middleManager.log]: bin/run-druid 
middleManager conf/druid/single-server/micro-quickstart
 ```
 
-All persistent state such as the cluster metadata store and segments for the 
services will be kept in the `var` directory under the 
apache-druid-{{DRUIDVERSION}} package root. Logs for the services are located 
at `var/sv`.
+All persistent state, such as the cluster metadata store and segments for the 
services, are kept in the `var` directory under 
+the Druid root directory, apache-druid-{{DRUIDVERSION}}. Each service writes 
to a log file under `var/sv`, as noted in the startup script output above.
+
+At any time, you can revert Druid to its original, post-installation state by 
deleting the entire `var` directory. You may
+want to do this, for example, between Druid tutorials or after 
experimentation, to start with a fresh instance. 
+
+To stop Druid at any time, use CTRL-C in the terminal. This exits the 
`bin/start-micro-quickstart` script and 
+terminates all Druid processes. 
+
 
-Later on, if you'd like to stop the services, CTRL-C to exit the 
`bin/start-micro-quickstart` script, which will terminate the Druid processes.
+## Step 3. Open the Druid console 
 
-Once the cluster has started, you can navigate to 
[http://localhost:8888](http://localhost:8888).
-The [Druid router process](../design/router.md), which serves the [Druid 
console](../operations/druid-console.md), resides at this address.
+After the Druid services finish startup, open the [Druid 
console](../operations/druid-console.md) at 
[http://localhost:8888](http://localhost:8888). 
 
 ![Druid console](../assets/tutorial-quickstart-01.png "Druid console")
 
-It takes a few seconds for all the Druid processes to fully start up. If you 
open the console immediately after starting the services, you may see some 
errors that you can safely ignore.
-
-
-## Loading data
-
-### Tutorial dataset
-
-For the following data loading tutorials, we have included a sample data file 
containing Wikipedia page edit events that occurred on 2015-09-12.
-
-This sample data is located at 
`quickstart/tutorial/wikiticker-2015-09-12-sampled.json.gz` from the Druid 
package root.
-The page edit events are stored as JSON objects in a text file.
-
-The sample data has the following columns, and an example event is shown below:
-
-  * added
-  * channel
-  * cityName
-  * comment
-  * countryIsoCode
-  * countryName
-  * deleted
-  * delta
-  * isAnonymous
-  * isMinor
-  * isNew
-  * isRobot
-  * isUnpatrolled
-  * metroCode
-  * namespace
-  * page
-  * regionIsoCode
-  * regionName
-  * user
-
-```json
-{
-  "timestamp":"2015-09-12T20:03:45.018Z",
-  "channel":"#en.wikipedia",
-  "namespace":"Main",
-  "page":"Spider-Man's powers and equipment",
-  "user":"foobar",
-  "comment":"/* Artificial web-shooters */",
-  "cityName":"New York",
-  "regionName":"New York",
-  "regionIsoCode":"NY",
-  "countryName":"United States",
-  "countryIsoCode":"US",
-  "isAnonymous":false,
-  "isNew":false,
-  "isMinor":false,
-  "isRobot":false,
-  "isUnpatrolled":false,
-  "added":99,
-  "delta":99,
-  "deleted":0,
-}
-```
+It may take a few seconds for all Druid services to finish starting, including 
the [Druid router](../design/router.md), which serves the console. If you 
attempt to open the Druid console before startup is complete, you may see 
errors in the browser. Wait a few moments and try again. 
 
 
-### Data loading tutorials
+## Step 4. Load data
 
-The following tutorials demonstrate various methods of loading data into 
Druid, including both batch and streaming use cases.
-All tutorials assume that you are using the `micro-quickstart` single-machine 
configuration mentioned above.
 
-- [Loading a file](./tutorial-batch.md) - this tutorial demonstrates how to 
perform a batch file load, using Druid's native batch ingestion.
-- [Loading stream data from Apache Kafka](./tutorial-kafka.md) - this tutorial 
demonstrates how to load streaming data from a Kafka topic.
-- [Loading a file using Apache Hadoop](./tutorial-batch-hadoop.md) - this 
tutorial demonstrates how to perform a batch file load, using a remote Hadoop 
cluster.
-- [Writing your own ingestion spec](./tutorial-ingestion-spec.md) - this 
tutorial demonstrates how to write a new ingestion spec and use it to load data.
+Ingestion specs define the schema of the data Druid reads and stores. You can 
write ingestion specs by hand or using the _data loader_, 
+as we will do here. 
 
-### Resetting cluster state
+For this tutorial, we'll load sample data bundled with Druid that represents 
Wikipedia page edits on a given day. 
 
-If you want a clean start after stopping the services, delete the `var` 
directory and run the `bin/start-micro-quickstart` script again.
+1. Click **Load data** from the Druid console header (![Load 
data](../assets/tutorial-batch-data-loader-00.png)).
 
-Once every service has started, you are now ready to load data.
+2. Select the **Local disk** tile and then click **Connect data**.
 
-#### Resetting Kafka
+   ![Data loader init](../assets/tutorial-batch-data-loader-01.png "Data 
loader init")
+
+3. Enter the following values: 
+
+   - **Base directory**: `quickstart/tutorial/`
+
+   - **File filter**: `wikiticker-2015-09-12-sampled.json.gz` 
+
+   ![Data location](../assets/tutorial-batch-data-loader-015.png "Data 
location")
+
+   Entering the base directory and [wildcard file 
filter](https://commons.apache.org/proper/commons-io/apidocs/org/apache/commons/io/filefilter/WildcardFileFilter.html)
 separately, as afforded by the UI, allows you to specify multiple files for 
ingestion at once.
+
+4. Click **Apply**. 
+
+   The data loader displays the raw data, giving you a chance to verify that 
the data 
+   appears as expected. 
+
+   ![Data loader sample](../assets/tutorial-batch-data-loader-02.png "Data 
loader sample")
+
+   Notice that your position in the sequence of steps to load data, 
**Connect** in our case, appears at the top of the console, as shown below. 
+   You can click other steps to move forward or backward in the sequence at 
any time.
+   
+   ![Load data](../assets/tutorial-batch-data-loader-12.png)  
+   
+
+5. Click **Next: Parse data**. 
+
+   The data loader tries to determine the parser appropriate for the data 
format automatically. In this case 
+   it identifies the data format as `json`, as shown in the **Input format** 
field at the bottom right.
+
+   ![Data loader parse data](../assets/tutorial-batch-data-loader-03.png "Data 
loader parse data")
+
+   Feel free to select other **Input format** options to get a sense of their 
configuration settings 
+   and how Druid parses other types of data.  
+
+6. With the JSON parser selected, click **Next: Parse time**. The **Parse 
time** settings are where you view and adjust the 
+   primary timestamp column for the data.
+
+   ![Data loader parse time](../assets/tutorial-batch-data-loader-04.png "Data 
loader parse time")
+
+   Druid requires data to have a primary timestamp column (internally stored 
in a column called `__time`).
+   If you do not have a timestamp in your data, select `Constant value`. In 
our example, the data loader 
+   determines that the `time` column is the only candidate that can be used as 
the primary time column.
+
+7. Click **Next: Transform**, **Next: Filter**, and then **Next: Configure 
schema**, skipping a few steps.
+
+   You do not need to adjust transformation or filtering settings, as applying 
ingestion time transforms and 
+   filters are out of scope for this tutorial.
+
+8. The Configure schema settings are where you configure what 
[dimensions](../ingestion/index.md#dimensions) 
+   and [metrics](../ingestion/index.md#metrics) are ingested. The outcome of 
this configuration represents exactly how the 
+   data will appear in Druid after ingestion. 
+
+   Since our dataset is very small, you can turn off 
[rollup](../ingestion/index.md#rollup) 
+   by unsetting the **Rollup** switch and confirming the change when prompted.
+
+   ![Data loader schema](../assets/tutorial-batch-data-loader-05.png "Data 
loader schema")
+
+
+10. Click **Next: Partition** to configure how the data will be split into 
segments. In this case, choose `DAY` as 
+    the **Segment Granularity**. 
+
+    ![Data loader partition](../assets/tutorial-batch-data-loader-06.png "Data 
loader partition")
+
+    Since this is a small dataset, we can have just a single segment, which is 
what selecting `DAY` as the 
+    segment granularity gives us. 
+
+11. Click **Next: Tune** and **Next: Publish**.
+
+12. The Publish settings are where you can specify the datasource name in 
Druid. Change the default from `wikiticker-2015-09-12-sampled` 

Review comment:
       Done. Alternatively, I wondered if we should call it something unique, 
like wikipedia-batchfile, or something, to allow the tutorial datasources to 
live together without name collisions. Perhaps for later...




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] sthetland commented on a change in pull request #9766: Druid Quickstart refactor and update

Reply via email to