Re: [PR] [docs] Updating Rollup tutorial (druid)

via GitHub Thu, 25 Jul 2024 10:15:29 -0700


ektravel commented on code in PR #16762:
URL: https://github.com/apache/druid/pull/16762#discussion_r1691847557



##########
docs/tutorials/tutorial-rollup.md:
##########
@@ -49,150 +52,98 @@ For this tutorial, we'll use a small sample of network 
flow event data, represen
 {"timestamp":"2018-01-02T21:35:45Z","srcIP":"7.7.7.7", 
"dstIP":"8.8.8.8","packets":12,"bytes":2818}
 ```
 
-A file containing this sample input data is located at 
`quickstart/tutorial/rollup-data.json`.
-
-We'll ingest this data using the following ingestion task spec, located at 
`quickstart/tutorial/rollup-index.json`.
-
-```json
-{
-  "type" : "index_parallel",
-  "spec" : {
-    "dataSchema" : {
-      "dataSource" : "rollup-tutorial",
-      "dimensionsSpec" : {
-        "dimensions" : [
-          "srcIP",
-          "dstIP"
-        ]
-      },
-      "timestampSpec": {
-        "column": "timestamp",
-        "format": "iso"
-      },
-      "metricsSpec" : [
-        { "type" : "count", "name" : "count" },
-        { "type" : "longSum", "name" : "packets", "fieldName" : "packets" },
-        { "type" : "longSum", "name" : "bytes", "fieldName" : "bytes" }
-      ],
-      "granularitySpec" : {
-        "type" : "uniform",
-        "segmentGranularity" : "week",
-        "queryGranularity" : "minute",
-        "intervals" : ["2018-01-01/2018-01-03"],
-        "rollup" : true
-      }
-    },
-    "ioConfig" : {
-      "type" : "index_parallel",
-      "inputSource" : {
-        "type" : "local",
-        "baseDir" : "quickstart/tutorial",
-        "filter" : "rollup-data.json"
-      },
-      "inputFormat" : {
-        "type" : "json"
-      },
-      "appendToExisting" : false
-    },
-    "tuningConfig" : {
-      "type" : "index_parallel",
-      "partitionsSpec": {
-        "type": "dynamic"
-      },
-      "maxRowsInMemory" : 25000
-    }
-  }
-}
+Load the sample dataset using the [`INSERT 
INTO`](../multi-stage-query/reference.md/#insert) statement and the 
[`EXTERN`](../multi-stage-query/reference.md/#extern-function) function to 
ingest the data inline. In the [Druid web 
console](../operations/web-console.md), go to the **Query** view and run the 
following query:
+
+```sql
+INSERT INTO "rollup_tutorial"
+WITH "inline_data" AS (
+  SELECT *
+  FROM TABLE(EXTERN('{
+    "type":"inline",
+    
"data":"{\"timestamp\":\"2018-01-01T01:01:35Z\",\"srcIP\":\"1.1.1.1\",\"dstIP\":\"2.2.2.2\",\"packets\":20,\"bytes\":9024}\n{\"timestamp\":\"2018-01-01T01:02:14Z\",\"srcIP\":\"1.1.1.1\",\"dstIP\":\"2.2.2.2\",\"packets\":38,\"bytes\":6289}\n{\"timestamp\":\"2018-01-01T01:01:59Z\",\"srcIP\":\"1.1.1.1\",\"dstIP\":\"2.2.2.2\",\"packets\":11,\"bytes\":5780}\n{\"timestamp\":\"2018-01-01T01:01:51Z\",\"srcIP\":\"1.1.1.1\",\"dstIP\":\"2.2.2.2\",\"packets\":255,\"bytes\":21133}\n{\"timestamp\":\"2018-01-01T01:02:29Z\",\"srcIP\":\"1.1.1.1\",\"dstIP\":\"2.2.2.2\",\"packets\":377,\"bytes\":359971}\n{\"timestamp\":\"2018-01-01T01:03:29Z\",\"srcIP\":\"1.1.1.1\",\"dstIP\":\"2.2.2.2\",\"packets\":49,\"bytes\":10204}\n{\"timestamp\":\"2018-01-02T21:33:14Z\",\"srcIP\":\"7.7.7.7\",\"dstIP\":\"8.8.8.8\",\"packets\":38,\"bytes\":6289}\n{\"timestamp\":\"2018-01-02T21:33:45Z\",\"srcIP\":\"7.7.7.7\",\"dstIP\":\"8.8.8.8\",\"packets\":123,\"bytes\":93999}\n{\"timestamp\":\"2018-01-02T21:35:45Z\",\"srcIP\"
 :\"7.7.7.7\",\"dstIP\":\"8.8.8.8\",\"packets\":12,\"bytes\":2818}"}', 
+    '{"type":"json"}')) 
+    EXTEND ("timestamp" VARCHAR, "srcIP" VARCHAR, "dstIP" VARCHAR, "packets" 
BIGINT, "bytes" BIGINT)
+)
+SELECT
+  FLOOR(TIME_PARSE("timestamp") TO MINUTE) AS __time,
+  "srcIP",
+  "dstIP",
+  SUM("bytes") AS "bytes",
+  SUM("packets") AS "packets",
+  COUNT(*) AS "count"
+FROM "inline_data"
+GROUP BY 1, 2, 3
+PARTITIONED BY DAY
 ```
 
-Rollup has been enabled by setting `"rollup" : true` in the `granularitySpec`.
-
-Note that we have `srcIP` and `dstIP` defined as dimensions, a longSum metric 
is defined for the `packets` and `bytes` columns, and the `queryGranularity` 
has been defined as `minute`.
+In the query, you group by dimensions, `timestamp`, `srcIP`, and `dstIP`. Note 
that the query uses the `FLOOR` function to bucket rows based on MINUTE 
granularity.
+For the metrics, you apply aggregations to sum the `bytes` and `packets` 
columns and add a column that counts the number of rows that get rolled up.

Review Comment:
   ```suggestion
   For the metrics, you apply aggregations to sum the `bytes` and `packets` 
columns and add a column that counts the number of rolled-up rows.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [docs] Updating Rollup tutorial (druid)

Reply via email to