vtlim commented on code in PR #16953:
URL: https://github.com/apache/druid/pull/16953#discussion_r1761952966
##########
docs/tutorials/tutorial-sketches-theta.md:
##########
@@ -95,103 +97,37 @@ date,uid,show,episode
## Ingest data using Theta sketches
-1. Navigate to the **Load data** wizard in the web console.
-2. Select `Paste data` as the data source and paste the given data:
-
-
-
-3. Leave the source type as `inline` and click **Apply** and **Next: Parse
data**.
-4. Parse the data as CSV, with included headers:
-
-
-
-5. Accept the default values in the **Parse time**, **Transform**, and
**Filter** stages.
-6. In the **Configure schema** stage, enable rollup and confirm your choice in
the dialog. Then set the query granularity to `day`.
-
-
-
-7. Add the Theta sketch during this stage. Select **Add metric**.
-8. Define the new metric as a Theta sketch with the following details:
- * **Name**: `theta_uid`
- * **Type**: `thetaSketch`
- * **Field name**: `uid`
- * **Size**: Accept the default value, `16384`.
- * **Is input theta sketch**: Accept the default value, `False`.
-
-
-
-9. Click **Apply** to add the new metric to the data model.
-
-
-10. You are not interested in individual user ID's, only the unique counts.
Right now, `uid` is still in the data model. To remove it, click on the `uid`
column in the data model and delete it using the trashcan icon on the right:
-
-
-
-11. For the remaining stages of the **Load data** wizard, set the following
options:
- * **Partition**: Set **Segment granularity** to `day`.
- * **Tune**: Leave the default options.
- * **Publish**: Set the datasource name to `ts_tutorial`.
-
-On the **Edit spec** page, your final input spec should match the following:
-
-```json
-{
- "type": "index_parallel",
- "spec": {
- "ioConfig": {
- "type": "index_parallel",
- "inputSource": {
- "type": "inline",
- "data": "date,uid,show,episode\n2022-05-19,alice,Game of
Thrones,S1E1\n2022-05-19,alice,Game of Thrones,S1E2\n2022-05-19,alice,Game of
Thrones,S1E1\n2022-05-19,bob,Bridgerton,S1E1\n2022-05-20,alice,Game of
Thrones,S1E1\n2022-05-20,carol,Bridgerton,S1E2\n2022-05-20,dan,Bridgerton,S1E1\n2022-05-21,alice,Game
of Thrones,S1E1\n2022-05-21,carol,Bridgerton,S1E1\n2022-05-21,erin,Game of
Thrones,S1E1\n2022-05-21,alice,Bridgerton,S1E1\n2022-05-22,bob,Game of
Thrones,S1E1\n2022-05-22,bob,Bridgerton,S1E1\n2022-05-22,carol,Bridgerton,S1E2\n2022-05-22,bob,Bridgerton,S1E1\n2022-05-22,erin,Game
of Thrones,S1E1\n2022-05-22,erin,Bridgerton,S1E2\n2022-05-23,erin,Game of
Thrones,S1E1\n2022-05-23,alice,Game of Thrones,S1E1"
- },
- "inputFormat": {
- "type": "csv",
- "findColumnsFromHeader": true
- }
- },
- "tuningConfig": {
- "type": "index_parallel",
- "partitionsSpec": {
- "type": "hashed"
- },
- "forceGuaranteedRollup": true
- },
- "dataSchema": {
- "dataSource": "ts_tutorial",
- "timestampSpec": {
- "column": "date",
- "format": "auto"
- },
- "dimensionsSpec": {
- "dimensions": [
- "show",
- "episode"
- ]
- },
- "granularitySpec": {
- "queryGranularity": "day",
- "rollup": true,
- "segmentGranularity": "day"
- },
- "metricsSpec": [
- {
- "name": "count",
- "type": "count"
- },
- {
- "type": "thetaSketch",
- "name": "theta_uid",
- "fieldName": "uid"
- }
- ]
- }
- }
-}
+Load the sample dataset using the [`INSERT
INTO`](../multi-stage-query/reference.md/#insert) statement and the
[`EXTERN`](../multi-stage-query/reference.md/#extern-function) function to
ingest the sample data inline. In the [Druid web
console](../operations/web-console.md), go to the **Query** view and run the
following query:
+
+
+```sql
+INSERT INTO "ts_tutorial"
+WITH "source" AS (SELECT * FROM TABLE(
+ EXTERN(
+ '{"type":"inline","data":"date,uid,show,episode\n2022-05-19,alice,Game of
Thrones,S1E1\n2022-05-19,alice,Game of Thrones,S1E2\n2022-05-19,alice,Game of
Thrones,S1E1\n2022-05-19,bob,Bridgerton,S1E1\n2022-05-20,alice,Game of
Thrones,S1E1\n2022-05-20,carol,Bridgerton,S1E2\n2022-05-20,dan,Bridgerton,S1E1\n2022-05-21,alice,Game
of Thrones,S1E1\n2022-05-21,carol,Bridgerton,S1E1\n2022-05-21,erin,Game of
Thrones,S1E1\n2022-05-21,alice,Bridgerton,S1E1\n2022-05-22,bob,Game of
Thrones,S1E1\n2022-05-22,bob,Bridgerton,S1E1\n2022-05-22,carol,Bridgerton,S1E2\n2022-05-22,bob,Bridgerton,S1E1\n2022-05-22,erin,Game
of Thrones,S1E1\n2022-05-22,erin,Bridgerton,S1E2\n2022-05-23,erin,Game of
Thrones,S1E1\n2022-05-23,alice,Game of Thrones,S1E1"}',
+ '{"type":"csv","findColumnsFromHeader":true}'
+ )
+) EXTEND ("date" VARCHAR, "show" VARCHAR, "episode" VARCHAR, "uid" VARCHAR))
+SELECT
+ TIME_FLOOR(TIME_PARSE("date"), 'P1D') AS "__time",
+ "show",
+ "episode",
+ COUNT(*) AS "count",
+ DS_THETA("uid") AS "theta_uid"
+FROM "source"
+GROUP BY 1, 2, 3
+PARTITIONED BY DAY
```
-Notice the `theta_uid` object in the `metricsSpec` list, that defines the
`thetaSketch` aggregator on the `uid` column during ingestion.
+Notice that there is no `uid` in the `SELECT` statement.
+
+In this scenario you are not interested in individual user IDs, only the
unique counts.
+
+Instead you create Theta sketches on the values of `uid` using the `DS_THETA`
function.
Review Comment:
```suggestion
In this scenario you are not interested in individual user IDs, only the
unique counts.
Instead you create Theta sketches on the values of `uid` using the
`DS_THETA` function.
```
##########
docs/tutorials/tutorial-sketches-theta.md:
##########
@@ -95,103 +97,37 @@ date,uid,show,episode
## Ingest data using Theta sketches
-1. Navigate to the **Load data** wizard in the web console.
-2. Select `Paste data` as the data source and paste the given data:
-
-
-
-3. Leave the source type as `inline` and click **Apply** and **Next: Parse
data**.
-4. Parse the data as CSV, with included headers:
-
-
-
-5. Accept the default values in the **Parse time**, **Transform**, and
**Filter** stages.
-6. In the **Configure schema** stage, enable rollup and confirm your choice in
the dialog. Then set the query granularity to `day`.
-
-
-
-7. Add the Theta sketch during this stage. Select **Add metric**.
-8. Define the new metric as a Theta sketch with the following details:
- * **Name**: `theta_uid`
- * **Type**: `thetaSketch`
- * **Field name**: `uid`
- * **Size**: Accept the default value, `16384`.
- * **Is input theta sketch**: Accept the default value, `False`.
-
-
-
-9. Click **Apply** to add the new metric to the data model.
-
-
-10. You are not interested in individual user ID's, only the unique counts.
Right now, `uid` is still in the data model. To remove it, click on the `uid`
column in the data model and delete it using the trashcan icon on the right:
-
-
-
-11. For the remaining stages of the **Load data** wizard, set the following
options:
- * **Partition**: Set **Segment granularity** to `day`.
- * **Tune**: Leave the default options.
- * **Publish**: Set the datasource name to `ts_tutorial`.
-
-On the **Edit spec** page, your final input spec should match the following:
-
-```json
-{
- "type": "index_parallel",
- "spec": {
- "ioConfig": {
- "type": "index_parallel",
- "inputSource": {
- "type": "inline",
- "data": "date,uid,show,episode\n2022-05-19,alice,Game of
Thrones,S1E1\n2022-05-19,alice,Game of Thrones,S1E2\n2022-05-19,alice,Game of
Thrones,S1E1\n2022-05-19,bob,Bridgerton,S1E1\n2022-05-20,alice,Game of
Thrones,S1E1\n2022-05-20,carol,Bridgerton,S1E2\n2022-05-20,dan,Bridgerton,S1E1\n2022-05-21,alice,Game
of Thrones,S1E1\n2022-05-21,carol,Bridgerton,S1E1\n2022-05-21,erin,Game of
Thrones,S1E1\n2022-05-21,alice,Bridgerton,S1E1\n2022-05-22,bob,Game of
Thrones,S1E1\n2022-05-22,bob,Bridgerton,S1E1\n2022-05-22,carol,Bridgerton,S1E2\n2022-05-22,bob,Bridgerton,S1E1\n2022-05-22,erin,Game
of Thrones,S1E1\n2022-05-22,erin,Bridgerton,S1E2\n2022-05-23,erin,Game of
Thrones,S1E1\n2022-05-23,alice,Game of Thrones,S1E1"
- },
- "inputFormat": {
- "type": "csv",
- "findColumnsFromHeader": true
- }
- },
- "tuningConfig": {
- "type": "index_parallel",
- "partitionsSpec": {
- "type": "hashed"
- },
- "forceGuaranteedRollup": true
- },
- "dataSchema": {
- "dataSource": "ts_tutorial",
- "timestampSpec": {
- "column": "date",
- "format": "auto"
- },
- "dimensionsSpec": {
- "dimensions": [
- "show",
- "episode"
- ]
- },
- "granularitySpec": {
- "queryGranularity": "day",
- "rollup": true,
- "segmentGranularity": "day"
- },
- "metricsSpec": [
- {
- "name": "count",
- "type": "count"
- },
- {
- "type": "thetaSketch",
- "name": "theta_uid",
- "fieldName": "uid"
- }
- ]
- }
- }
-}
+Load the sample dataset using the [`INSERT
INTO`](../multi-stage-query/reference.md/#insert) statement and the
[`EXTERN`](../multi-stage-query/reference.md/#extern-function) function to
ingest the sample data inline. In the [Druid web
console](../operations/web-console.md), go to the **Query** view and run the
following query:
+
+
+```sql
+INSERT INTO "ts_tutorial"
+WITH "source" AS (SELECT * FROM TABLE(
+ EXTERN(
+ '{"type":"inline","data":"date,uid,show,episode\n2022-05-19,alice,Game of
Thrones,S1E1\n2022-05-19,alice,Game of Thrones,S1E2\n2022-05-19,alice,Game of
Thrones,S1E1\n2022-05-19,bob,Bridgerton,S1E1\n2022-05-20,alice,Game of
Thrones,S1E1\n2022-05-20,carol,Bridgerton,S1E2\n2022-05-20,dan,Bridgerton,S1E1\n2022-05-21,alice,Game
of Thrones,S1E1\n2022-05-21,carol,Bridgerton,S1E1\n2022-05-21,erin,Game of
Thrones,S1E1\n2022-05-21,alice,Bridgerton,S1E1\n2022-05-22,bob,Game of
Thrones,S1E1\n2022-05-22,bob,Bridgerton,S1E1\n2022-05-22,carol,Bridgerton,S1E2\n2022-05-22,bob,Bridgerton,S1E1\n2022-05-22,erin,Game
of Thrones,S1E1\n2022-05-22,erin,Bridgerton,S1E2\n2022-05-23,erin,Game of
Thrones,S1E1\n2022-05-23,alice,Game of Thrones,S1E1"}',
+ '{"type":"csv","findColumnsFromHeader":true}'
+ )
+) EXTEND ("date" VARCHAR, "show" VARCHAR, "episode" VARCHAR, "uid" VARCHAR))
+SELECT
+ TIME_FLOOR(TIME_PARSE("date"), 'P1D') AS "__time",
+ "show",
+ "episode",
+ COUNT(*) AS "count",
+ DS_THETA("uid") AS "theta_uid"
+FROM "source"
+GROUP BY 1, 2, 3
+PARTITIONED BY DAY
```
-Notice the `theta_uid` object in the `metricsSpec` list, that defines the
`thetaSketch` aggregator on the `uid` column during ingestion.
+Notice that there is no `uid` in the `SELECT` statement.
+
+In this scenario you are not interested in individual user IDs, only the
unique counts.
+
Review Comment:
```suggestion
In this scenario you are not interested in individual user IDs, only the
unique counts.
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]