vtlim commented on code in PR #16953:
URL: https://github.com/apache/druid/pull/16953#discussion_r1755205671
##########
docs/tutorials/tutorial-sketches-theta.md:
##########
@@ -60,9 +60,11 @@ In this tutorial, you will learn how to do the following:
## Prerequisites
-For this tutorial, you should have already downloaded Druid as described in
-the [single-machine quickstart](index.md) and have it running on your local
machine.
-It will also be helpful to have finished [Tutorial: Loading a
file](../tutorials/tutorial-batch.md) and [Tutorial: Querying
data](../tutorials/tutorial-query.md).
+Before proceeding, download Druid as described in the [single-machine
quickstart](index.md) and have it running on your local machine. You don't need
to load any data into the Druid cluster.
+
+It's helpful to have finished [Tutorial: Loading a
file](../tutorials/tutorial-batch.md) and [Tutorial: Querying
data](../tutorials/tutorial-query.md).
+
+## Sample Data
Review Comment:
Sentence case
##########
docs/tutorials/tutorial-sketches-theta.md:
##########
@@ -95,103 +97,29 @@ date,uid,show,episode
## Ingest data using Theta sketches
-1. Navigate to the **Load data** wizard in the web console.
-2. Select `Paste data` as the data source and paste the given data:
-
-
-
-3. Leave the source type as `inline` and click **Apply** and **Next: Parse
data**.
-4. Parse the data as CSV, with included headers:
-
-
-
-5. Accept the default values in the **Parse time**, **Transform**, and
**Filter** stages.
-6. In the **Configure schema** stage, enable rollup and confirm your choice in
the dialog. Then set the query granularity to `day`.
-
-
-
-7. Add the Theta sketch during this stage. Select **Add metric**.
-8. Define the new metric as a Theta sketch with the following details:
- * **Name**: `theta_uid`
- * **Type**: `thetaSketch`
- * **Field name**: `uid`
- * **Size**: Accept the default value, `16384`.
- * **Is input theta sketch**: Accept the default value, `False`.
-
-
-
-9. Click **Apply** to add the new metric to the data model.
-
-
-10. You are not interested in individual user ID's, only the unique counts.
Right now, `uid` is still in the data model. To remove it, click on the `uid`
column in the data model and delete it using the trashcan icon on the right:
-
-
-
-11. For the remaining stages of the **Load data** wizard, set the following
options:
- * **Partition**: Set **Segment granularity** to `day`.
- * **Tune**: Leave the default options.
- * **Publish**: Set the datasource name to `ts_tutorial`.
-
-On the **Edit spec** page, your final input spec should match the following:
-
-```json
-{
- "type": "index_parallel",
- "spec": {
- "ioConfig": {
- "type": "index_parallel",
- "inputSource": {
- "type": "inline",
- "data": "date,uid,show,episode\n2022-05-19,alice,Game of
Thrones,S1E1\n2022-05-19,alice,Game of Thrones,S1E2\n2022-05-19,alice,Game of
Thrones,S1E1\n2022-05-19,bob,Bridgerton,S1E1\n2022-05-20,alice,Game of
Thrones,S1E1\n2022-05-20,carol,Bridgerton,S1E2\n2022-05-20,dan,Bridgerton,S1E1\n2022-05-21,alice,Game
of Thrones,S1E1\n2022-05-21,carol,Bridgerton,S1E1\n2022-05-21,erin,Game of
Thrones,S1E1\n2022-05-21,alice,Bridgerton,S1E1\n2022-05-22,bob,Game of
Thrones,S1E1\n2022-05-22,bob,Bridgerton,S1E1\n2022-05-22,carol,Bridgerton,S1E2\n2022-05-22,bob,Bridgerton,S1E1\n2022-05-22,erin,Game
of Thrones,S1E1\n2022-05-22,erin,Bridgerton,S1E2\n2022-05-23,erin,Game of
Thrones,S1E1\n2022-05-23,alice,Game of Thrones,S1E1"
- },
- "inputFormat": {
- "type": "csv",
- "findColumnsFromHeader": true
- }
- },
- "tuningConfig": {
- "type": "index_parallel",
- "partitionsSpec": {
- "type": "hashed"
- },
- "forceGuaranteedRollup": true
- },
- "dataSchema": {
- "dataSource": "ts_tutorial",
- "timestampSpec": {
- "column": "date",
- "format": "auto"
- },
- "dimensionsSpec": {
- "dimensions": [
- "show",
- "episode"
- ]
- },
- "granularitySpec": {
- "queryGranularity": "day",
- "rollup": true,
- "segmentGranularity": "day"
- },
- "metricsSpec": [
- {
- "name": "count",
- "type": "count"
- },
- {
- "type": "thetaSketch",
- "name": "theta_uid",
- "fieldName": "uid"
- }
- ]
- }
- }
-}
-```
+Load the sample dataset using the [`INSERT
INTO`](../multi-stage-query/reference.md/#insert) statement and the
[`EXTERN`](../multi-stage-query/reference.md/#extern-function) function to
ingest the sample data inline. In the [Druid web
console](../operations/web-console.md), go to the **Query** view and run the
following query:
-Notice the `theta_uid` object in the `metricsSpec` list, that defines the
`thetaSketch` aggregator on the `uid` column during ingestion.
-Click **Submit** to start the ingestion.
+```sql
+INSERT INTO "ts_tutorial"
+WITH "source" AS (SELECT * FROM TABLE(
+ EXTERN(
+ '{"type":"inline","data":"date,uid,show,episode\n2022-05-19,alice,Game of
Thrones,S1E1\n2022-05-19,alice,Game of Thrones,S1E2\n2022-05-19,alice,Game of
Thrones,S1E1\n2022-05-19,bob,Bridgerton,S1E1\n2022-05-20,alice,Game of
Thrones,S1E1\n2022-05-20,carol,Bridgerton,S1E2\n2022-05-20,dan,Bridgerton,S1E1\n2022-05-21,alice,Game
of Thrones,S1E1\n2022-05-21,carol,Bridgerton,S1E1\n2022-05-21,erin,Game of
Thrones,S1E1\n2022-05-21,alice,Bridgerton,S1E1\n2022-05-22,bob,Game of
Thrones,S1E1\n2022-05-22,bob,Bridgerton,S1E1\n2022-05-22,carol,Bridgerton,S1E2\n2022-05-22,bob,Bridgerton,S1E1\n2022-05-22,erin,Game
of Thrones,S1E1\n2022-05-22,erin,Bridgerton,S1E2\n2022-05-23,erin,Game of
Thrones,S1E1\n2022-05-23,alice,Game of Thrones,S1E1"}',
+ '{"type":"csv","findColumnsFromHeader":true}'
+ )
+) EXTEND ("date" VARCHAR, "show" VARCHAR, "episode" VARCHAR, "uid" VARCHAR))
+SELECT
+ TIME_FLOOR(TIME_PARSE("date"), 'P1D') AS "__time",
+ "show",
+ "episode",
+ COUNT(*) AS "count",
+ DS_THETA("uid") AS "theta_uid"
+FROM "source"
+GROUP BY 1, 2, 3
+PARTITIONED BY DAY
+```
+
+Notice how there is no `uid` in the `SELECT` statement. In this scenario you
are not interested in individual user ID's, only the unique counts. Instead you
use the `DS_THETA` aggregator function to create a Theta sketch on the values
of `uid`. The
[`DS_THETA`](../development/extensions-core/datasketches-theta.md#aggregator)
function has an optional second parameter, `size`, which accepts a positive
integer-power of 2 greater than 0. The `size` parameter refers to the maximum
number of entries the Theta sketch object retains. Higher values of `size`
result in higher accuracy, but require more space. The default value of `size`
is 16384, and is recommended in most use cases. The `GROUP BY` statement groups
the entries for each episode of a show watched on the same day.
Review Comment:
```suggestion
Notice that there is no `uid` in the `SELECT` statement.
In this scenario you are not interested in individual user IDs, only the
unique counts.
Instead you create Theta sketches on the values of `uid` using the
`DS_THETA` function.
[`DS_THETA`](../development/extensions-core/datasketches-theta.md#aggregator)
has an optional second parameter that controls the accuracy and size of the
sketches.
The `GROUP BY` statement groups the entries for each episode of a show
watched on the same day.
```
Comments:
* For a large paragraph it's better to separate them into multiple lines for
easier tracking and version control
* We don't need to go into so much detail for `size` in a tutorial
##########
docs/tutorials/tutorial-sketches-theta.md:
##########
@@ -256,7 +181,10 @@ SELECT THETA_SKETCH_ESTIMATE(
FROM ts_tutorial
```
-
+The `APPROX_COUNT_DISTINCT_DS_THETA` function applies the following:
+
+* `DS_THETA`: Creates a new Theta sketch from the column of Theta sketches.
+* `THETA_SKETCH_ESTIMATE`: Calculates the distinct count estimate from the
output of `DS_THETA` where the show is _Bridgerton_.
Review Comment:
>where the show is _Bridgerton_.
Shouldn't be in this description. The FILTER part doesn't belong to
THETA_SKETCH_ESTIMATE
##########
docs/tutorials/tutorial-sketches-theta.md:
##########
@@ -274,7 +202,7 @@ SELECT THETA_SKETCH_ESTIMATE(
FROM ts_tutorial
```
-
Review Comment:
Why did you delete these three images?

##########
docs/tutorials/tutorial-sketches-theta.md:
##########
@@ -209,36 +137,23 @@ Let's first see what the data looks like in Druid. Run
the following SQL stateme
SELECT * FROM ts_tutorial
```
-
+
The Theta sketch column `theta_uid` appears as a Base64-encoded string; behind
it is a bitmap.
-The following query to compute the distinct counts of user IDs uses
`APPROX_COUNT_DISTINCT_DS_THETA` and groups by the other dimensions:
-```sql
-SELECT __time,
- "show",
- "episode",
- APPROX_COUNT_DISTINCT_DS_THETA(theta_uid) AS users
-FROM ts_tutorial
-GROUP BY 1, 2, 3
-```
-
-
-
-In the preceding query, `APPROX_COUNT_DISTINCT_DS_THETA` is equivalent to
calling `DS_THETA` and `THETA_SKETCH_ESIMATE` as follows:
+The following query uses `THETA_SKETCH_ESTIMATE` to compute the distinct
counts of user IDs and groups by the other dimensions:
```sql
-SELECT __time,
- "show",
- "episode",
- THETA_SKETCH_ESTIMATE(DS_THETA(theta_uid)) AS users
-FROM ts_tutorial
-GROUP BY 1, 2, 3
+SELECT
+ __time,
+ "show",
+ "episode",
+ THETA_SKETCH_ESTIMATE(theta_uid) AS users
+FROM ts_tutorial
+GROUP BY 1, 2, 3, 4
Review Comment:
Double check SQL standard on this. I don't think you need the GROUP BY since
you're just selecting and applying a scalar function. When I try the query
without GROUP BY, I get the same thing.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]