[GitHub] jon-wei commented on a change in pull request #6126: New quickstart and tutorials

GitBox Thu, 09 Aug 2018 00:34:18 -0700

jon-wei commented on a change in pull request #6126: New quickstart and 
tutorials
URL: https://github.com/apache/incubator-druid/pull/6126#discussion_r208831341


 ##########
 File path: docs/content/tutorials/tutorial-ingestion-spec.md
 ##########
 @@ -0,0 +1,641 @@
+---
+layout: doc_page
+---
+
+# Tutorial: Writing an ingestion spec
+
+This tutorial will guide the reader through the process of defining an 
ingestion spec, pointing out key considerations and guidelines.
+
+For this tutorial, we'll assume you've already downloaded Druid as described 
in 
+the [single-machine quickstart](index.html) and have it running on your local 
machine. 
+
+It will also be helpful to have finished [Tutorial: Loading a 
file](/docs/VERSION/tutorials/tutorial-batch.html), [Tutorial: Querying 
data](/docs/VERSION/tutorials/tutorial-query.html), and [Tutorial: 
Rollup](/docs/VERSION/tutorials/tutorial-rollup.html).
+
+## Example data
+
+Suppose we have the following network flow data:
+
+* `srcIP`: IP address of sender
+* `srcPort`: Port of sender
+* `dstIP`: IP address of receiver
+* `dstPort`: Port of receiver
+* `protocol`: IP protocol number
+* `packets`: number of packets transmitted
+* `bytes`: number of bytes transmitted
+* `cost`: the cost of sending the traffic
+
+```
+{"ts":"2018-01-01T01:01:35Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2", 
"srcPort":2000, "dstPort":3000, "protocol": 6, "packets":10, "bytes":1000, 
"cost": 1.4}
+{"ts":"2018-01-01T01:01:51Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2", 
"srcPort":2000, "dstPort":3000, "protocol": 6, "packets":20, "bytes":2000, 
"cost": 3.1}
+{"ts":"2018-01-01T01:01:59Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2", 
"srcPort":2000, "dstPort":3000, "protocol": 6, "packets":30, "bytes":3000, 
"cost": 0.4}
+{"ts":"2018-01-01T01:02:14Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2", 
"srcPort":5000, "dstPort":7000, "protocol": 6, "packets":40, "bytes":4000, 
"cost": 7.9}
+{"ts":"2018-01-01T01:02:29Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2", 
"srcPort":5000, "dstPort":7000, "protocol": 6, "packets":50, "bytes":5000, 
"cost": 10.2}
+{"ts":"2018-01-01T01:03:29Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2", 
"srcPort":5000, "dstPort":7000, "protocol": 6, "packets":60, "bytes":6000, 
"cost": 4.3}
+{"ts":"2018-01-01T02:33:14Z","srcIP":"7.7.7.7", "dstIP":"8.8.8.8", 
"srcPort":4000, "dstPort":5000, "protocol": 17, "packets":100, "bytes":10000, 
"cost": 22.4}
+{"ts":"2018-01-01T02:33:45Z","srcIP":"7.7.7.7", "dstIP":"8.8.8.8", 
"srcPort":4000, "dstPort":5000, "protocol": 17, "packets":200, "bytes":20000, 
"cost": 34.5}
+{"ts":"2018-01-01T02:35:45Z","srcIP":"7.7.7.7", "dstIP":"8.8.8.8", 
"srcPort":4000, "dstPort":5000, "protocol": 17, "packets":300, "bytes":30000, 
"cost": 46.3}
+```
+
+Save the JSON contents above into a file called `ingestion-tutorial-data.json`.
+
+Let's walk through the process of defining an ingestion spec that can load 
this data. 
+
+For this tutorial, we will be using the native batch indexing task. When using 
other task types, some aspects of the ingestion spec will differ, and this 
tutorial will point out such areas.
+
+## Defining the schema
+
+The core element of a Druid ingestion spec is the `dataSchema`. The 
`dataSchema` defines how to parse input data into a set of columns that will be 
stored in Druid.
+
+Let's start with an empty `dataSchema` and add fields to it as we progress 
through the tutorial.
+
+Create a new file called `ingestion-tutorial-index.json` with the following 
contents:
+
+```json
+"dataSchema" : {}
+```
+
+We will be making successive edits to this ingestion spec as we progress 
through the tutorial.
+
+### Datasource name
+
+The datasource name is specified by the `dataSource` parameter in the 
`dataSchema.
+
+```json
+"dataSchema" : {
+  "dataSource" : "ingestion-tutorial",
+}
+```
+
+Let's call the tutorial datasource `ingestion-tutorial`.
+
+### Choose a parser
+
+A `dataSchema` has a `parser` field, which defines the parser that Druid will 
use to interpret the input data.
+
+Since our input data is represented as JSON strings, we'll use a `string` 
parser with `json` format:
+
+```
+"dataSchema" : {
+  "dataSource" : "ingestion-tutorial",
+  "parser" : {
+    "type" : "string",
+    "parseSpec" : {
+      "format" : "json"
+    }
+  }
+}
+```
+
+### Time column
+
+The `parser` needs to know how to extract the main timestamp field from the 
input data. When using a `json` type `parseSpec`, the timestamp is defined in a 
`timestampSpec`. 
+
+The timestamp column in our input data is named "ts", containing ISO 8601 
timestamps, so let's add a `timestampSpec` with that information to the 
`parseSpec`:
+
+```
+"dataSchema" : {
+  "dataSource" : "ingestion-tutorial",
+  "parser" : {
+    "type" : "string",
+    "parseSpec" : {
+      "format" : "json",
+      "timestampSpec" : {
+        "format" : "iso",
+        "column" : "ts"
+      }
+    }
+  }
+}
+```
+
+### Column types
+
+Now that we've defined the time column, let's look at definitions for other 
columns.
+
+Druid supports the following column types: String, Long, Float, Double. We 
will see how these are used in the following sections.
+
+Before we move on to how we define our other non-time columns, let's discuss 
`rollup` first.
+
+### Rollup
+
+When ingesting data, we must consider whether we wish to use rollup or not.
+
+* If rollup is enabled, we will need to separate the input columns into two 
categories, "dimensions" and "metrics". "Dimensions" are the grouping columns 
for rollup, while "metrics" are the columns that will be aggregated.
+
+* If rollup is disabled, then all columns are treated as "dimensions" and no 
pre-aggregation occurs.
+
+For this tutorial, let's enable rollup. This is specified with a 
`granularitySpec` on the `dataSchema`. 
+
+Note that the `granularitySpec` lies outside of the `parser`. We will revist 
the `parser` soon when we define our dimensions and metrics.
+
+```
+"dataSchema" : {
+  "dataSource" : "ingestion-tutorial",
+  "parser" : {
+    "type" : "string",
+    "parseSpec" : {
+      "format" : "json",
+      "timestampSpec" : {
+        "format" : "iso",
+        "column" : "ts"
+      }
+    }
+  },
+  "granularitySpec" : {
+    "rollup" : true
+  }
+}
+
+```
+
+#### Choosing dimensions and metrics
+
+For this example dataset, the following is a sensible split for "dimensions" 
and "metrics":
+
+* Dimensions: srcIP, srcPort, dstIP, dstPort, protocol
+* Metrics: packets, bytes, cost
+
+The dimensions here are a group of properties that identify a unidirectional 
flow of IP traffic, while the metrics represent facts about the IP traffic flow 
specified by a dimension grouping.
+
+Let's look at how to define these dimensions and metrics within the ingestion 
spec.
+
+#### Dimensions
+
+Dimensions are specified with a `dimensionsSpec` inside the `parseSpec`.
+
+```
+"dataSchema" : {
+  "dataSource" : "ingestion-tutorial",
+  "parser" : {
+    "type" : "string",
+    "parseSpec" : {
+      "format" : "json",
+      "timestampSpec" : {
+        "format" : "iso",
+        "column" : "ts"
+      },
+      "dimensionsSpec" : {
+        "dimensions": [
+          "srcIP",
+          { "name" : "srcPort", "type" : "long" },
+          { "name" : "dstIP", "type" : "string" },
+          { "name" : "dstPort", "type" : "long" },
+          { "name" : "protocol", "type" : "string" }
+        ]
+      }
+    }
+  },
+  "granularitySpec" : {
+    "rollup" : true
+  }
+}
+```
+
+Each dimension has a `name` and a `type`, where `type` can be "long", "float", 
"double", or "string".
+
+Note that `srcIP` is a "string" dimension; for string dimensions, it is enough 
to specify just a dimension name, since "string" is the default dimension type.
+
+Also note that `protocol` is a numeric value in the input data, but we are 
ingesting it as a "string" column; Druid will coerce the input longs to strings 
during ingestion.
+ 
+##### Strings vs. Numerics
+
+Should a numeric input be ingested as a numeric dimension or as a string 
dimension?
+
+Numeric dimensions have the following pros/cons relative to String dimensions:
+* Pros: Numeric representation can result in smaller column sizes on disk and 
lower processing overhead when reading values from the column
+* Cons: Numeric dimensions do not have indices, so filtering on them will 
often be slower than filtering on an equivalent String dimension (which has 
bitmap indices)
+
+#### Metrics
+
+Metrics are specified with a `metricsSpec` inside the `dataSchema`:
+
+```json
+"dataSchema" : {
+  "dataSource" : "ingestion-tutorial",
+  "parser" : {
+    "type" : "string",
+    "parseSpec" : {
+      "format" : "json",
+      "timestampSpec" : {
+        "format" : "iso",
+        "column" : "ts"
+      },
+      "dimensionsSpec" : {
+        "dimensions": [
+          "srcIP",
+          { "name" : "srcPort", "type" : "long" },
+          { "name" : "dstIP", "type" : "string" },
+          { "name" : "dstPort", "type" : "long" },
+          { "name" : "protocol", "type" : "string" }
+        ]
+      }   
+    }
+  },
+  "metricsSpec" : [
+    { "type" : "count", "name" : "count" },
+    { "type" : "longSum", "name" : "packets", "fieldName" : "packets" },
+    { "type" : "longSum", "name" : "bytes", "fieldName" : "bytes" },
+    { "type" : "doubleSum", "name" : "cost", "fieldName" : "cost" }
+  ],
+  "granularitySpec" : {
+    "rollup" : true
+  }
+}
+```
+
+When defining a metric, it is necessary to specify what type of aggregation 
should be performed on that column during rollup.
+
+Here we have defined long sum aggregations on the two long metric columns, 
`packets` and `bytes`, and a double sum aggregation for the `cost` column.
+
+Note that the `metricsSpec` is on a different nesting level than 
`dimensionSpec` or `parseSpec`; it belongs on the same nesting level as 
`parser` within the `dataSchema`.
+
+Note that we have also defined a `count` aggregator. The count aggregator will 
track how many rows in the original input data contributed to a "rolled up" row 
in the final ingested data.
+
+### No rollup
+
+If we were not using rollup, all columns would be specified in the 
`dimensionsSpec`, e.g.:
+
+```
+      "dimensionsSpec" : {
+        "dimensions": [
+          "srcIP",
+          { "name" : "srcPort", "type" : "long" },
+          { "name" : "dstIP", "type" : "string" },
+          { "name" : "dstPort", "type" : "long" },
+          { "name" : "protocol", "type" : "string" },
+          { "name" : "packets", "type" : "long" },
+          { "name" : "bytes", "type" : "long" },
+          { "name" : "srcPort", "type" : "double" }
+        ]
+      },
+```
+
+
+### Define granularities
+
+At this point, we are done defining the `parser` and `metricsSpec` within the 
`dataSchema` and we are almost done writing the ingestion spec.
+
+There are some additional properties we need to set in the `granularitySpec`:
+* Type of granularitySpec: `uniform` and `arbitrary` are the two supported 
types. For this tutorial, we will use a `uniform` granularity spec, where all 
segments have uniform interval sizes (for example, all segments cover an hour's 
worth of data).
+* The segment granularity: what size of time interval should a single segment 
contain data for? e.g., `DAY`, `WEEK`
+* The bucketing granularity of the timestamps in the time column (referred to 
as `queryGranularity`)
+
+#### Segment granularity
+
+Segment granularity is configured by the `segmentGranularity` property in the 
`granularitySpec`. For this tutorial, we'll create hourly segments:
+
+```
+"dataSchema" : {
+  "dataSource" : "ingestion-tutorial",
+  "parser" : {
+    "type" : "string",
+    "parseSpec" : {
+      "format" : "json",
+      "timestampSpec" : {
+        "format" : "iso",
+        "column" : "ts"
+      },
+      "dimensionsSpec" : {
+        "dimensions": [
+          "srcIP",
+          { "name" : "srcPort", "type" : "long" },
+          { "name" : "dstIP", "type" : "string" },
+          { "name" : "dstPort", "type" : "long" },
+          { "name" : "protocol", "type" : "string" }
+        ]
+      }      
+    }
+  },
+  "metricsSpec" : [
+    { "type" : "count", "name" : "count" },
+    { "type" : "longSum", "name" : "packets", "fieldName" : "packets" },
+    { "type" : "longSum", "name" : "bytes", "fieldName" : "bytes" },
+    { "type" : "doubleSum", "name" : "cost", "fieldName" : "cost" }
+  ],
+  "granularitySpec" : {
+    "type" : "uniform",
+    "segmentGranularity" : "HOUR",
+    "rollup" : true
+  }
+}
+```
+
+Our input data has events from two separate hours, so this task will generate 
two segments.
+
+#### Query granularity
+
+The query granularity is configured by the `queryGranularity` property in the 
`granularitySpec`. For this tutorial, let's use minute granularity:
+
+```
+"dataSchema" : {
+  "dataSource" : "ingestion-tutorial",
+  "parser" : {
+    "type" : "string",
+    "parseSpec" : {
+      "format" : "json",
+      "timestampSpec" : {
+        "format" : "iso",
+        "column" : "ts"
+      },
+      "dimensionsSpec" : {
+        "dimensions": [
+          "srcIP",
+          { "name" : "srcPort", "type" : "long" },
+          { "name" : "dstIP", "type" : "string" },
+          { "name" : "dstPort", "type" : "long" },
+          { "name" : "protocol", "type" : "string" }
+        ]
+      }      
+    }
+  },
+  "metricsSpec" : [
+    { "type" : "count", "name" : "count" },
+    { "type" : "longSum", "name" : "packets", "fieldName" : "packets" },
+    { "type" : "longSum", "name" : "bytes", "fieldName" : "bytes" },
+    { "type" : "doubleSum", "name" : "cost", "fieldName" : "cost" }
+  ],
+  "granularitySpec" : {
+    "type" : "uniform",
+    "segmentGranularity" : "HOUR",
+    "queryGranularity" : "MINUTE"
+    "rollup" : true
+  }
+}
+```
+
+To see the effect of the query granularity, let's look at this row from the 
raw input data:
+
+```
+{"ts":"2018-01-01T01:03:29Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2", 
"srcPort":5000, "dstPort":7000, "protocol": 6, "packets":60, "bytes":6000, 
"cost": 4.3}
+```
+
+When this row is ingested with minute queryGranularity, Druid will floor the 
row's timestamp to minute buckets:
+
+```
+{"ts":"2018-01-01T01:03:00Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2", 
"srcPort":5000, "dstPort":7000, "protocol": 6, "packets":60, "bytes":6000, 
"cost": 4.3}
+```
+
+#### Define an interval (batch only)
 
 Review comment:
   I left this as-is

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] jon-wei commented on a change in pull request #6126: New quickstart and tutorials

Reply via email to