techdocsmith commented on code in PR #15902:
URL: https://github.com/apache/druid/pull/15902#discussion_r1498409367
##########
docs/querying/sql-window-functions.md:
##########
@@ -88,40 +88,135 @@ You can use the OVER clause to treat other Druid
aggregation functions as window
Window functions support aliasing.
-## Define a window with the OVER clause
+## Window function syntax
+
+In general, Druid Window functions use the following syntax:
+
+
+```sql
+window_function() OVER (
+ [PARTITION BY partitioning expression]
+ [ORDER BY order expression]
+ [[ROWS, RANGE] BETWEEN range start AND range end])
+FROM table
+GROUP BY dimensions
+```
+
+```sql
+window_function() OVER w
+FROM table
+WINDOW w AS ([PARTITION BY ] [ORDER BY]
+ [[ROWS, RANGE] BETWEEN range start AND range end] )
+GROUP BY dimensions
+```
The OVER clause defines the query windows for window functions as follows:
- PARTITION BY indicates the dimension that defines the rows within the window
-- ORDER BY specifies the order of the rows within the windows.
+- ORDER BY specifies the order of the rows within the windows
+
+An empty OVER clause or the absence of a PARTITION BY clause indicates that
all data belongs to a single window.
+
+For example, the following OVER clause example sets the window dimension to
`channel` and orders the results by the absolute value of `delta` ascending:
+
+```sql
+...
+RANK() OVER (PARTITION BY channel ORDER BY ABS(delta) ASC)
+...
+```
+
+Window frames, set in ROWS and RANGE expressions, limit the set of rows used
for the windowed aggregation. The general syntax is:
+
+ROWS AND RANGE accept the following values start and end :
+- UNBOUND PRECEDING - from the beginning of the partition as order by the
order expression
+- N ROWS PRECEDING - N rows before the current row as ordered by the order
expression
+- CURRENT ROW - the current row
+- N ROWS FOLLOWING - N rows after the current row as ordered by the order
expression
+- UNBOUNDED FOLLOWING - to the end of the partition as ordered by the order
expression
+
+See (Example with window frames)(#example-with-window-frames) for more detail.
+
+Druid applies the GROUP BY dimensions first before calculating all non-window
aggregation functions. Then it applies the window function over the aggregate
results.
:::note
Sometimes windows are called partitions. However, the partitioning for window
functions are a shuffle (partition) of the result set created at query time and
is not to be confused with Druid's segment partitioning feature which
partitions data at ingest time.
:::
-The following OVER clause example sets the window dimension to `channel` and
orders the results by the absolute value of `delta` ascending:
+### ORDER BY windows
+
+When the window definition only specifies ORDER BY , it sorts the aggregate
data set and applies the function in that order.
+
+The following query uses ORDER BY SUM(delta) DESC to rank user hourly activity
from the most changed the least changed within an hour:
```sql
-...
-RANK() OVER (PARTITION BY channel ORDER BY ABS(delta) ASC)
-...
+SELECT
+ TIME_FLOOR(__time, 'PT1H') as time_hour,
+ channel,
+ user,
+ SUM(delta) net_user_changes,
+ RANK( ) OVER ( ORDER BY SUM(delta) DESC ) editing_rank
+FROM "wikipedia"
+WHERE channel IN ('#kk.wikipedia', '#lt.wikipedia')
+ AND __time BETWEEN '2016-06-27' AND '2016-06-28'
+GROUP BY TIME_FLOOR(__time, 'PT1H'), channel, user
+ORDER BY 5
```
+### PARTITION BY windows
+
+When a window only specifies PARTITION BY partition expression, Druid
calculates the aggregate window function over all the rows that share a values
within the selected dataset.
+
+The following example demonstrates a query that uses two different windows
PARTITION BY channel and PARTITION BY user to calculate the total activity in
the channel and total activity by the user so that they can be compared to
individual hourly activity:
+
+```sql
+SELECT
+ TIME_FLOOR(__time, 'PT1H') as time_hour, channel, user,
+ SUM(delta) hourly_user_changes,
+ SUM(SUM(delta)) OVER (PARTITION BY user ) AS total_user_changes,
+ SUM(SUM(delta)) OVER (PARTITION BY channel ) AS total_channel_changes
+FROM "wikipedia"
+WHERE channel IN ('#kk.wikipedia', '#lt.wikipedia')
+ AND __time BETWEEN '2016-06-27' AND '2016-06-28'
+GROUP BY TIME_FLOOR(__time, 'PT1H'),2,3
+ORDER BY channel,TIME_FLOOR(__time, 'PT1H'), user
+```
+
+In this example, the dataset is filtered for a single day, therefore the
window function results represent the total activity for the day, for the
`user` and for the `channel` respectively.
+
+This type of result helps you analyze the impact of an individual user's
hourly activity:
+- the impact to the channel by comparing `hourly_user_changes` to
`total_channel_changes`
+- the impact of each user over the channel by `total_user_changes` to
`total_channel_changes`
+- the progress of each user's individual activity by comparing
`hourly_user_changes` to `total_user_changes`
+
+
+
+
+#### Window frame guardrails
+
+Druid has guardrail logic to prevent you from executing window function
queries with window frame expressions that might return unexpected results.
+
+For example:
+- You cannot set expressions as bounds for window frames.
+- You cannot use two FOLLOWING expressions in the window frame. For example:
"rows between 2 FOLLOWING and 3 FOLLOWING"
+- You can only use a RANGE frames when both endpoints are unbounded or current
row.
+
+If you write a query that violates one of these conditions, Druid throws an
error: "The query contains a window frame which may return incorrect results.
To disregard this warning, set [`windowingStrictValidation`] to false in the
query context."
+
## Window function reference
|Function|Notes|
|--------|-----|
-| `ROW_NUMBER()`| Returns the number of the row within the window |
-|`RANK()`| Returns the rank for a row within a window |
-|`DENSE_RANK()`| Returns the rank for a row within a window without gaps. For
example, if two rows tie for rank of 1, the subsequent row is ranked 2. |
-|`PERCENT_RANK()`| Returns the rank of the row calculated as a percentage
according to the formula: `(rank - 1) / (total window rows - 1)` |
-|`CUME_DIST()`| Returns the cumulative distribution of the current row within
the window calculated as `number of window rows at the same rank or higher than
current row` / `total window rows` |
-|`NTILE(tiles)`| Divides the rows within a window as evenly as possible into
the number of tiles, also called buckets, and returns the value of the tile
that the row falls into | None |
-|`LAG(expr[, offset])`| Returns the value evaluated at the row that precedes
the current row by the offset number within the window. `offset` defaults to 1
if not provided |
-|`LEAD(expr[, offset])`| Returns the value evaluated at the row that follows
the current row by the offset number within the window; if there is no such
row, returns the given default value. `offset` defaults to 1 if not provided |
-|`FIRST_VALUE(expr)`| Returns the value for the expression for the first row
within the window |
-|`LAST_VALUE(expr)`| Returns the value for the expression for the last row
within the window |
+| `ROW_NUMBER()` | Returns the number of the row within the window starting
from 1 |
+| `RANK()` | Returns the rank with gaps for a row within a window. For
example, if two rows tie for rank 1, the next rank is 3 |
+| `DENSE_RANK()` | Returns the rank for a row within a window without gaps.
For example, if two rows tie for rank of 1, the subsequent row is ranked 2. |
+| `PERCENT_RANK()` | Returns the relative rank of the row calculated as a
percentage according to the formula: `RANK() OVER (window) / COUNT(1) OVER
(window)` |
+| `CUME_DIST()` | Returns the cumulative distribution of the current row
within the window calculated as number of window rows at the same rank or
higher than current row divided by total window rows. The return value ranges
between `1/number of rows` and 1 |
+| `NTILE(tiles)` | Divides the rows within a window as evenly as possible into
the number of tiles, also called buckets, and returns the value of the tile
that the row falls into | None |
+| `LAG(expr[, offset])` | If you do not supply an `offset`, returns the value
evaluated at the row preeeding the current row. Specify an offset number, `n`
to return the value evaluated at `n` rows preceding the current one |
Review Comment:
```suggestion
| `LAG(expr[, offset])` | If you do not supply an `offset`, returns the
value evaluated at the row preding the current row. Specify an offset number,
`n` to return the value evaluated at `n` rows preceding the current one |
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]