Re: [PR] update window functions doc (druid)

via GitHub Fri, 16 Feb 2024 05:12:11 -0800


kgyrtkirk commented on code in PR #15902:
URL: https://github.com/apache/druid/pull/15902#discussion_r1492424134



##########
docs/querying/sql-window-functions.md:
##########
@@ -88,40 +88,171 @@ You can use the OVER clause to treat other Druid 
aggregation functions as window
 
 Window functions support aliasing.
 
-## Define a window with the OVER clause
+## Window function syntax
+
+In general, Druid Window functions use the following syntax:
+
+```sql
+SELECT
+    dimensions,
+    aggregation function(s)
+    window_function()
+      OVER ( PARTITION BY partitioning expression
+             ORDER BY order expression
+             frame clause
+            )
+    FROM table
+    GROUP BY dimensions
+```
 
 The OVER clause defines the query windows for window functions as follows:
 - PARTITION BY indicates the dimension that defines the rows within the window
 - ORDER BY specifies the order of the rows within the windows.
 
+for example, the following OVER clause example sets the window dimension to 
`channel` and orders the results by the absolute value of `delta` ascending:
+
+```sql
+...
+RANK() OVER (PARTITION BY channel ORDER BY ABS(delta) ASC)
+...
+```
+
+Druid applies the GROUP BY dimensions first before calculating all non-window 
aggregation functions. Then it applies the window function over the aggregate 
results.
+
 :::note
 
 Sometimes windows are called partitions. However, the partitioning for window 
functions are a shuffle (partition) of the result set created at query time and 
is not to be confused with Druid's segment partitioning feature which 
partitions data at ingest time.
 
 :::
 
-The following OVER clause example sets the window dimension to `channel` and 
orders the results by the absolute value of `delta` ascending:
+### ORDER BY windows
+
+When the window definition only specifies ORDER BY , it sorts the aggregate 
data set and applies the function in that order.
+
+The following query uses ORDER BY SUM(delta) DESC to rank user hourly activity 
from the most changed the least changed within an hour:
 
 ```sql
-...
-RANK() OVER (PARTITION BY channel ORDER BY ABS(delta) ASC)
-...
+SELECT
+    TIME_FLOOR(__time, 'PT1H') as time_hour, 
+    channel, 
+    user,
+    SUM(delta) net_user_changes,
+    RANK( ) OVER ( ORDER BY SUM(delta) DESC ) editing_rank
+FROM "wikipedia"
+WHERE channel IN ('#kk.wikipedia', '#lt.wikipedia')
+  AND __time BETWEEN '2016-06-27' AND '2016-06-28'
+GROUP BY TIME_FLOOR(__time, 'PT1H'), channel, user
+ORDER BY 5 
+```
+
+### PARTITION BY windows
+
+When a window only specifies PARTITION BY partition expression, Druid 
calculates the aggregate window function over all the rows that share a values 
within the selected dataset.
+
+The following example demonstrates a query that uses two different windows 
PARTITION BY channel and PARTITION BY user to calculate the total activity in 
the channel and total activity by the user so that they can be compared to 
individual hourly activity:
+
+```sql
+SELECT
+    TIME_FLOOR(__time, 'PT1H') as time_hour, channel, user,
+    SUM(delta) hourly_user_changes,
+    SUM(SUM(delta)) OVER (PARTITION BY user ) AS total_user_changes,
+    SUM(SUM(delta)) OVER (PARTITION BY channel ) AS total_channel_changes
+FROM "wikipedia"
+WHERE channel IN ('#kk.wikipedia', '#lt.wikipedia')
+  AND __time BETWEEN '2016-06-27' AND '2016-06-28'
+GROUP BY TIME_FLOOR(__time, 'PT1H'),2,3
+ORDER BY channel,TIME_FLOOR(__time, 'PT1H'), user
+```
+
+The windows only define the PARTITION BY clause of the window, so Druid 
performs the calculation over the whole dataset for each value of the partition 
expression.
+In this example, the dataset is filtered for a single day, therefore the 
window function results represent the total activity for the day, for the 
`user` and for the `channel` respectively.
+
+This type of result helps you analyze the impact of an individual user's 
hourly activity:
+- the impact to the channel by comparing `hourly_user_changes` to 
`total_channel_changes`
+- the impact of each user over the channel by `total_user_changes` to 
`total_channel_changes`
+- the progress of each user's individual activity by comparing 
`hourly_user_changes` to `total_user_changes`
+
+### Window frames
+
+You can use window frames to limit the set of rows used for the windowed 
aggregation. The general syntax is:
+
+```sql
+window function
+OVER (
+        [ PARTITION BY partition expression] ORDER BY order expression

Review Comment:
   `ORDER BY` is not mandatory



##########
docs/querying/sql-window-functions.md:
##########
@@ -88,40 +88,171 @@ You can use the OVER clause to treat other Druid 
aggregation functions as window
 
 Window functions support aliasing.
 
-## Define a window with the OVER clause
+## Window function syntax
+
+In general, Druid Window functions use the following syntax:
+
+```sql
+SELECT
+    dimensions,
+    aggregation function(s)
+    window_function()
+      OVER ( PARTITION BY partitioning expression

Review Comment:
   I think the section about window frames could be moved up here; as that 
gives a better understanding of partition by and other parts of the frame.
   
   As this sections is about `syntax` - possibly remove the `select` and other 
non-related things; and only keep
   ```
   window_function() OVER  window
   ```
   
   I think it would be important to somehow show that the `window` can be 
specified later as well
   ```
   window_function() OVER  w
   from t
   WINDOW w AS (PARTITION BY ... )
   ```
   
   



##########
docs/querying/sql-functions.md:
##########
@@ -1092,6 +1140,14 @@ Returns a slice of the array from the zero-based start 
and end indexes.
 
 Joins all elements of the array together by the given delimiter.
 
+## NTILE
+
+`NTILE(tiles)`
+
+**Function type:** [Window](sql-window-functions.md#window-function-reference)
+
+Divides the rows within a window as evenly as possible into the number of 
tiles, also called buckets, and returns the value of the tile that the row 
falls into.

Review Comment:
   `NTILE(2)` will return `1` and `2` - I think we are good



##########
docs/querying/sql-functions.md:
##########
@@ -1124,6 +1180,14 @@ Parses `expr` into a `COMPLEX<json>` object. This 
operator deserializes JSON val
 
 Converts a string into a BIGINT with the given base or into a DECIMAL data 
type if the base is not specified.
 
+## PERCENT_RANK
+
+`PERCENT_RANK()`
+
+**Function type:** [Window](sql-window-functions.md#window-function-reference)
+
+Returns the rank of the row calculated as a percentage according to the 
formula: `(rank - 1) / (total window rows - 1)`.

Review Comment:
   probably an alternate way to write the formula would be `RANK() OVER 
(window) / COUNT(1) OVER (window)`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] update window functions doc (druid)

Reply via email to