[GitHub] jon-wei commented on a change in pull request #6126: New quickstart and tutorials

GitBox Thu, 09 Aug 2018 00:31:21 -0700

jon-wei commented on a change in pull request #6126: New quickstart and 
tutorials
URL: https://github.com/apache/incubator-druid/pull/6126#discussion_r208830812


 ##########
 File path: docs/content/tutorials/tutorial-batch.md
 ##########
 @@ -2,137 +2,153 @@
 layout: doc_page
 ---
 
-# Tutorial: Load your own batch data
+# Tutorial: Loading a file
 
 ## Getting started
 
-This tutorial shows you how to load your own data files into Druid.
+This tutorial demonstrates how to perform a batch file load, using Druid's 
native batch ingestion.
 
 For this tutorial, we'll assume you've already downloaded Druid as described 
in 
-the [single-machine quickstart](quickstart.html) and have it running on your 
local machine. You 
+the [single-machine quickstart](index.html) and have it running on your local 
machine. You 
 don't need to have loaded any data yet.
 
-Once that's complete, you can load your own dataset by writing a custom 
ingestion spec.
+## Preparing the data and the ingestion task spec
 
-## Writing an ingestion spec
+A data load is initiated by submitting an *ingestion task* spec to the Druid 
overlord. For this tutorial, we'll be loading the sample Wikipedia page edits 
data.
 
-When loading files into Druid, you will use Druid's [batch 
loading](../ingestion/batch-ingestion.html) process.
-There's an example batch ingestion spec in `quickstart/wikiticker-index.json` 
that you can modify 
-for your own needs.
+The Druid package includes the following sample native batch ingestion task 
spec at `quickstart/wikipedia-index.json`, shown here for convenience,
+which has been configured to read the 
`quickstart/wikiticker-2015-09-12-sampled.json.gz` input file:
 
-The most important questions are:
-
-  * What should the dataset be called? This is the "dataSource" field of the 
"dataSchema".
-  * Where is the dataset located? The file paths belong in the "paths" of the 
"inputSpec". If you 
-want to load multiple files, you can provide them as a comma-separated string.
-  * Which field should be treated as a timestamp? This belongs in the "column" 
of the "timestampSpec".
-  * Which fields should be treated as dimensions? This belongs in the 
"dimensions" of the "dimensionsSpec".
-  * Which fields should be treated as metrics? This belongs in the 
"metricsSpec".
-  * What time ranges (intervals) are being loaded? This belongs in the 
"intervals" of the "granularitySpec".
-
-If your data does not have a natural sense of time, you can tag each row with 
the current time. 
-You can also tag all rows with a fixed timestamp, like 
"2000-01-01T00:00:00.000Z".
-
-Let's use this pageviews dataset as an example. Druid supports TSV, CSV, and 
JSON out of the box. 
-Note that nested JSON objects are not supported, so if you do use JSON, you 
should provide a file 
-containing flattened objects.
-
-```json
-{"time": "2015-09-01T00:00:00Z", "url": "/foo/bar", "user": "alice", 
"latencyMs": 32}
-{"time": "2015-09-01T01:00:00Z", "url": "/", "user": "bob", "latencyMs": 11}
-{"time": "2015-09-01T01:30:00Z", "url": "/foo/bar", "user": "bob", 
"latencyMs": 45}
 ```
-
-Make sure the file has no newline at the end. If you save this to a file 
called "pageviews.json", then for this dataset:
-
-  * Let's call the dataset "pageviews".
-  * The data is located in "pageviews.json".
-  * The timestamp is the "time" field.
-  * Good choices for dimensions are the string fields "url" and "user".
-  * Good choices for metrics are a count of pageviews, and the sum of 
"latencyMs". Collecting that 
-sum when we load the data will allow us to compute an average at query time as 
well.
-  * The data covers the time range 2015-09-01 (inclusive) through 2015-09-02 
(exclusive).
-
-You can copy the existing `quickstart/wikiticker-index.json` indexing task to 
a new file:
-
-```bash
-cp quickstart/wikiticker-index.json my-index-task.json
+{
+  "type" : "index",
+  "spec" : {
+    "dataSchema" : {
+      "dataSource" : "wikipedia",
+      "parser" : {
+        "type" : "string",
+        "parseSpec" : {
+          "format" : "json",
+          "dimensionsSpec" : {
+            "dimensions" : [
+              "channel",
+              "cityName",
+              "comment",
+              "countryIsoCode",
+              "countryName",
+              "isAnonymous",
+              "isMinor",
+              "isNew",
+              "isRobot",
+              "isUnpatrolled",
+              "metroCode",
+              "namespace",
+              "page",
+              "regionIsoCode",
+              "regionName",
+              "user",
+              { "name" : "commentLength", "type" : "long" },
+              { "name" : "deltaBucket", "type" : "long" },
+              "flags",
+              "diffUrl",
+              { "name": "added", "type": "long" },
+              { "name": "deleted", "type": "long" },
+              { "name": "delta", "type": "long" }
+            ]
+          },
+          "timestampSpec": {
+            "column": "time",
+            "format": "iso"
+          }
+        }
+      },
+      "metricsSpec" : [],
+      "granularitySpec" : {
+        "type" : "uniform",
+        "segmentGranularity" : "day",
+        "queryGranularity" : "none",
+        "intervals" : ["2015-09-12/2015-09-13"],
+        "rollup" : false
+      }
+    },
+    "ioConfig" : {
+      "type" : "index",
+      "firehose" : {
+        "type" : "local",
+        "baseDir" : "quickstart/",
+        "filter" : "wikiticker-2015-09-12-sampled.json.gz"
+      },
+      "appendToExisting" : false
+    },
+    "tuningConfig" : {
+      "type" : "index",
+      "targetPartitionSize" : 5000000,
+      "maxRowsInMemory" : 25000,
+      "forceExtendableShardSpecs" : true
+    }
+  }
+}
 ```
 
-And modify it by altering these sections:
+This spec will create a datasource named "wikipedia", 
 
-```json
-"dataSource": "pageviews"
-```
+## Load batch data
 
-```json
-"inputSpec": {
-  "type": "static",
-  "paths": "pageviews.json"
-}
-```
+We've included a sample of Wikipedia edits from June 27, 2016 to get you 
started.
 
 Review comment:
   Hm, good catch, I removed some of the extra columns that are only in the 
newer set

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] jon-wei commented on a change in pull request #6126: New quickstart and tutorials

Reply via email to