Re: [PR] [docs] Updating Rollup tutorial (druid)

via GitHub Thu, 25 Jul 2024 13:36:12 -0700


edgar2020 commented on code in PR #16762:
URL: https://github.com/apache/druid/pull/16762#discussion_r1692084584



##########
docs/tutorials/tutorial-rollup.md:
##########
@@ -49,150 +52,98 @@ For this tutorial, we'll use a small sample of network 
flow event data, represen
 {"timestamp":"2018-01-02T21:35:45Z","srcIP":"7.7.7.7", 
"dstIP":"8.8.8.8","packets":12,"bytes":2818}
 ```
 
-A file containing this sample input data is located at 
`quickstart/tutorial/rollup-data.json`.
-
-We'll ingest this data using the following ingestion task spec, located at 
`quickstart/tutorial/rollup-index.json`.
-
-```json
-{
-  "type" : "index_parallel",
-  "spec" : {
-    "dataSchema" : {
-      "dataSource" : "rollup-tutorial",
-      "dimensionsSpec" : {
-        "dimensions" : [
-          "srcIP",
-          "dstIP"
-        ]
-      },
-      "timestampSpec": {
-        "column": "timestamp",
-        "format": "iso"
-      },
-      "metricsSpec" : [
-        { "type" : "count", "name" : "count" },
-        { "type" : "longSum", "name" : "packets", "fieldName" : "packets" },
-        { "type" : "longSum", "name" : "bytes", "fieldName" : "bytes" }
-      ],
-      "granularitySpec" : {
-        "type" : "uniform",
-        "segmentGranularity" : "week",
-        "queryGranularity" : "minute",
-        "intervals" : ["2018-01-01/2018-01-03"],
-        "rollup" : true
-      }
-    },
-    "ioConfig" : {
-      "type" : "index_parallel",
-      "inputSource" : {
-        "type" : "local",
-        "baseDir" : "quickstart/tutorial",
-        "filter" : "rollup-data.json"
-      },
-      "inputFormat" : {
-        "type" : "json"
-      },
-      "appendToExisting" : false
-    },
-    "tuningConfig" : {
-      "type" : "index_parallel",
-      "partitionsSpec": {
-        "type": "dynamic"
-      },
-      "maxRowsInMemory" : 25000
-    }
-  }
-}
+Load the sample dataset using the [`INSERT 
INTO`](../multi-stage-query/reference.md/#insert) statement and the 
[`EXTERN`](../multi-stage-query/reference.md/#extern-function) function to 
ingest the data inline. In the [Druid web 
console](../operations/web-console.md), go to the **Query** view and run the 
following query:
+
+```sql
+INSERT INTO "rollup_tutorial"
+WITH "inline_data" AS (
+  SELECT *
+  FROM TABLE(EXTERN('{
+    "type":"inline",
+    
"data":"{\"timestamp\":\"2018-01-01T01:01:35Z\",\"srcIP\":\"1.1.1.1\",\"dstIP\":\"2.2.2.2\",\"packets\":20,\"bytes\":9024}\n{\"timestamp\":\"2018-01-01T01:02:14Z\",\"srcIP\":\"1.1.1.1\",\"dstIP\":\"2.2.2.2\",\"packets\":38,\"bytes\":6289}\n{\"timestamp\":\"2018-01-01T01:01:59Z\",\"srcIP\":\"1.1.1.1\",\"dstIP\":\"2.2.2.2\",\"packets\":11,\"bytes\":5780}\n{\"timestamp\":\"2018-01-01T01:01:51Z\",\"srcIP\":\"1.1.1.1\",\"dstIP\":\"2.2.2.2\",\"packets\":255,\"bytes\":21133}\n{\"timestamp\":\"2018-01-01T01:02:29Z\",\"srcIP\":\"1.1.1.1\",\"dstIP\":\"2.2.2.2\",\"packets\":377,\"bytes\":359971}\n{\"timestamp\":\"2018-01-01T01:03:29Z\",\"srcIP\":\"1.1.1.1\",\"dstIP\":\"2.2.2.2\",\"packets\":49,\"bytes\":10204}\n{\"timestamp\":\"2018-01-02T21:33:14Z\",\"srcIP\":\"7.7.7.7\",\"dstIP\":\"8.8.8.8\",\"packets\":38,\"bytes\":6289}\n{\"timestamp\":\"2018-01-02T21:33:45Z\",\"srcIP\":\"7.7.7.7\",\"dstIP\":\"8.8.8.8\",\"packets\":123,\"bytes\":93999}\n{\"timestamp\":\"2018-01-02T21:35:45Z\",\"srcIP\"
 :\"7.7.7.7\",\"dstIP\":\"8.8.8.8\",\"packets\":12,\"bytes\":2818}"}', 
+    '{"type":"json"}')) 
+    EXTEND ("timestamp" VARCHAR, "srcIP" VARCHAR, "dstIP" VARCHAR, "packets" 
BIGINT, "bytes" BIGINT)
+)
+SELECT
+  FLOOR(TIME_PARSE("timestamp") TO MINUTE) AS __time,
+  "srcIP",
+  "dstIP",
+  SUM("bytes") AS "bytes",
+  SUM("packets") AS "packets",
+  COUNT(*) AS "count"
+FROM "inline_data"
+GROUP BY 1, 2, 3
+PARTITIONED BY DAY
 ```
 
-Rollup has been enabled by setting `"rollup" : true` in the `granularitySpec`.
-
-Note that we have `srcIP` and `dstIP` defined as dimensions, a longSum metric 
is defined for the `packets` and `bytes` columns, and the `queryGranularity` 
has been defined as `minute`.
+In the query, you group by dimensions, `timestamp`, `srcIP`, and `dstIP`. Note 
that the query uses the `FLOOR` function to bucket rows based on MINUTE 
granularity.
+For the metrics, you apply aggregations to sum the `bytes` and `packets` 
columns and add a column that counts the number of rows that get rolled up.
 
-We will see how these definitions are used after we load this data.
+After the ingestion completes, you can query the data.
 
-## Load the example data
+## Query the example data
 
-From the apache-druid-{{DRUIDVERSION}} package root, run the following command:
+In the web console, open a new tab in the **Query** view. Run the following 
query to view the ingested data:
 
-```bash
-bin/post-index-task --file quickstart/tutorial/rollup-index.json --url 
http://localhost:8081
+```sql
+SELECT * FROM "rollup_tutorial"
 ```
 
-After the script completes, we will query the data.
+Returns the following:
 
-## Query the example data
+| `__time` | `srcIP` | `dstIP` | `bytes` | `count` | `packets` |
+| -- | -- | -- | -- | -- | -- |
+| `2018-01-01T01:01:00.000Z` | `1.1.1.1` | `2.2.2.2` | `35,937` | `3` | `286` |
+| `2018-01-01T01:02:00.000Z` | `1.1.1.1` | `2.2.2.2` | `366,260` | `2` | `415` 
|
+| `2018-01-01T01:03:00.000Z` | `1.1.1.1` | `2.2.2.2` | `10,204` | `1` | `49` |
+| `2018-01-02T21:33:00.000Z` | `7.7.7.7` | `8.8.8.8` | `100,288` | `2` | `161` 
|
+| `2018-01-02T21:35:00.000Z` | `7.7.7.7` | `8.8.8.8` | `2,818` | `1` | `12` |
 
-Let's run `bin/dsql` and issue a `select * from "rollup-tutorial";` query to 
see what data was ingested.
-
-```bash
-$ bin/dsql
-Welcome to dsql, the command-line client for Druid SQL.
-Type "\h" for help.
-dsql> select * from "rollup-tutorial";
-┌──────────────────────────┬────────┬───────┬─────────┬─────────┬─────────┐
-│ __time                   │ bytes  │ count │ dstIP   │ packets │ srcIP   │
-├──────────────────────────┼────────┼───────┼─────────┼─────────┼─────────┤
-│ 2018-01-01T01:01:00.000Z │  35937 │     3 │ 2.2.2.2 │     286 │ 1.1.1.1 │
-│ 2018-01-01T01:02:00.000Z │ 366260 │     2 │ 2.2.2.2 │     415 │ 1.1.1.1 │
-│ 2018-01-01T01:03:00.000Z │  10204 │     1 │ 2.2.2.2 │      49 │ 1.1.1.1 │
-│ 2018-01-02T21:33:00.000Z │ 100288 │     2 │ 8.8.8.8 │     161 │ 7.7.7.7 │
-│ 2018-01-02T21:35:00.000Z │   2818 │     1 │ 8.8.8.8 │      12 │ 7.7.7.7 │
-└──────────────────────────┴────────┴───────┴─────────┴─────────┴─────────┘
-Retrieved 5 rows in 1.18s.
-
-dsql>
-```
+Notice there are only six rows as opposed to the nine rows in the example 
data. The next section covers how ingestion with rollup accomplishes this.
+
+## View rollup in action
 
-Let's look at the three events in the original input data that occurred during 
`2018-01-01T01:01`:
+Consider the three events in the original input data that occur over the 
course of minute `2018-01-01T01:01`:
 
 ```json
 {"timestamp":"2018-01-01T01:01:35Z","srcIP":"1.1.1.1", 
"dstIP":"2.2.2.2","packets":20,"bytes":9024}
 {"timestamp":"2018-01-01T01:01:51Z","srcIP":"1.1.1.1", 
"dstIP":"2.2.2.2","packets":255,"bytes":21133}
 {"timestamp":"2018-01-01T01:01:59Z","srcIP":"1.1.1.1", 
"dstIP":"2.2.2.2","packets":11,"bytes":5780}
 ```
 
-These three rows have been "rolled up" into the following row:
+Druid combines the three rows into one during rollup:
 
-```bash
-┌──────────────────────────┬────────┬───────┬─────────┬─────────┬─────────┐
-│ __time                   │ bytes  │ count │ dstIP   │ packets │ srcIP   │
-├──────────────────────────┼────────┼───────┼─────────┼─────────┼─────────┤
-│ 2018-01-01T01:01:00.000Z │  35937 │     3 │ 2.2.2.2 │     286 │ 1.1.1.1 │
-└──────────────────────────┴────────┴───────┴─────────┴─────────┴─────────┘
-```
+| `__time` | `srcIP` | `dstIP` | `bytes` | `count` | `packets` |
+| -- | -- | -- | -- | -- | -- |
+| `2018-01-01T01:01:00.000Z` | `1.1.1.1` | `2.2.2.2` | `35,937` | `3` | `286` |
 
-The input rows have been grouped by the timestamp and dimension columns 
`{timestamp, srcIP, dstIP}` with sum aggregations on the metric columns 
`packets` and `bytes`.
+Before the grouping occurs, the `FLOOR(TIME_PARSE("timestamp") TO MINUTE)` 
expression buckets (floors) the timestamp column of the original input by 
minute.
 
-Before the grouping occurs, the timestamps of the original input data are 
bucketed/floored by minute, due to the `"queryGranularity":"minute"` setting in 
the ingestion spec.
+The input rows are grouped because they have the same values for their 
dimension columns `{timestamp, srcIP, dstIP}`. The metric columns calculate the 
sum aggregation of the grouped rows for `packets` and `bytes`. The `count` 
metric shows how many rows from the original input data contributed to the 
final "rolled up" row.

Review Comment:
   "roll up" changed to rolled-up



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [docs] Updating Rollup tutorial (druid)

Reply via email to