GitHub user xwu0226 opened a pull request:
https://github.com/apache/spark/pull/12147
[SPARK-14361][SQL]Window function exclude clause
#### What changes were proposed in this pull request?
The current Spark SQL does not support the `exclusion` clause, which is
part of ANSI SQL2003âs `Window` syntax. For example, IBM Netezza fully
supports it as shown in the [document web
link](https://www.ibm.com/support/knowledgecenter/SSULQD_7.1.0/com.ibm.nz.dbu.doc/c_dbuser_window_aggregation_family_syntax.html).
This PR is to fill the gap.
##### Introduction
Below is the ANSI SQL2003âs `Window` syntax:
```
FUNCTION_NAME(expr) OVER {window_name | (window_specification)}
window_specification ::= [window_name] [partitioning] [ordering] [framing]
partitioning ::= PARTITION BY value[, value...] [COLLATE collation_name]
ordering ::= ORDER [SIBLINGS] BY rule[, rule...]
rule ::= {value | position | alias} [ASC | DESC] [NULLS {FIRST | LAST}]
framing ::= {ROWS | RANGE} {start | between} [exclusion]
start ::= {UNBOUNDED PRECEDING | unsigned-integer PRECEDING | CURRENT ROW}
between ::= BETWEEN bound AND bound
bound ::= {start | UNBOUNDED FOLLOWING | unsigned-integer FOLLOWING}
exclusion ::= {EXCLUDE CURRENT ROW | EXCLUDE GROUP | EXCLUDE TIES | EXCLUDE
NO OTHERS}
```
Exclusion clause can be used to excluded certain rows from the window
framing when calculating window aggregation function (e.g. AVG, SUM, MAX, MIN,
COUNT, etc) related to current row. Types of window functions that are not
supported are listed below:
1. Offset functions, such as lead(), lag()
2. Ranking functions, such as rank(), dense_rank(), percent_rank(),
cume_dist, ntile()
3. Row number function, such as row_number()
##### Definition
Syntax | Description
------------ | -------------
EXCLUDE CURRENT ROW | Specifies excluding the current row.
EXCLUDE GROUP | Specifies excluding the current row and all rows that are
tied with it. Ties occur when there is a match on the order column or columns.
EXCLUDE NO OTHERS | Specifies not excluding any rows. This value is the
default if you specify no exclusion.
EXCLUDE TIES | Specifies excluding all rows that are tied with the current
row (peer rows), but retaining the current row.
##### Use-case Examples:
- Let's say you want to find out for every employee, where is his/her
salary at compared to the average salary of those within the same department
and whose ages are within 5 years younger and older. The query could be:â¨
```SQL
SELECT NAME, DEPT_ID, SALARY, AGE, AVG(SALARY) AS AVG_WITHIN_5_YEAR
OVER(PARTITION BY DEPT_ID
ORDER BY AGE
RANGE BETWEEN 5 PRECEDING AND 5 FOLLOWING
EXCLUDE CURRENT ROW)
FROM EMPLOYEEâ¨
```
- Let's say you want to compare every customer's yearly purchase with other
customers' average yearly purchase who are at different age group from the
current customer. The query could be:â¨
```SQL
SELECT CUST_NAME, AGE, PROD_CATEGORY, YEARLY_PURCHASE, AVG(YEARLY_PURCHASE)
OVER(PARTITION BY PROD_CATEGORY
ORDER BY AGE
RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUND FOLLOWING
EXCLUDE GROUP)
FROM CUSTOMER_PURCHASE_SUMâ¨
```
##### Implementation
**catalyst.parser.SqlBase.g4**
- add excludeClause definition and append to each case in windowFrame
- add excludeType definition for CURRENT ROW,GOUP, TIES, NO OTHERS
**catalyst.parser.AstBuilder.scala**
- parse excludeClause to generate `ExcludeClause `object to set into
`SpecifiedWindowFrame`
**catalyst.expressions.windowExpressions.scala**
- add `ExcludeClause `and `ExcludeType`
- add `ExcludeClause `object as a parameter to `SpecifiedWindowFrame`
- validate EXCLUDE GROUP/TIES requires ordered window frame
**catalyst.analysis.Analyzer.scala**
- validate exclude clause does not support `RowNumberLike`, `RankLike `and
`OffsetWindowFunction`
**execution.Window.scala**
- implement the exclusion logic in sliding frame, expanding frame,
shrinking frame and whole partition frame
**Some Test case examples:**
**DDL and DML:**
sql("create table table1 (col1 int, col2 int, col3 int)")
sql("insert into table1 select 6, 12, 10")
sql("insert into table1 select 6, 11, 4")
sql("insert into table1 select 6, 13, 11")
sql("insert into table1 select 6, 9, 10")
sql("insert into table1 select 6, 15, 8")
sql("insert into table1 select 6, 10, 1")
sql("insert into table1 select 6, 15, 8")
sql("insert into table1 select 6, 7, 4")
sql("insert into table1 select 6, 7, 8")
**Queries:**
```
// sliding frame with exclude current row
select col1, col2, col3, sum(col2)
over (partition by col1
order by col3
rows between 2 preceding and 2 following
exclude current row)
from table1 where col1 = 6
```
##### Results
col1 | col2 | col3 | sum(col2)
----- | ------ | ----- | -------
6 | 10 | 1 | 18
6 | 11 | 4 | 32
**6** | **7** | **4** | **51**
6 | 15 | 8 | 40
6 | 15 | 8 | 40
6 | 7 | 8 | 51
6 | 12 | 10 | 44
6 | 9 | 10 | 32
6 | 6 | 13 | 11 | 21
In this case, for example, if we are looking at the (6, 7, 4) row, the
frame around this current row has 5 rows (6, 10, 1), (6,11, 4), **(6, 7, 4)**,
(6, 15, 8), (6, 15, 8), including the current row. With the `exclude current
row` clause, then the calculation of sum(col2) is on the other 4 rows, which is
10 + 11 + 15 + 15 = 51.
```
//expanding frame with exclude current row
select col1, col2, col3, sum(col2)
over (partition by col1
order by col3
rows between unbounded preceding and current row
exclude group)
from table1 where col1 = 6
```
##### Results
col1 | col2 | col3 | sum(col2)
----- | ------ | ----- | -------
6 | 10 | 1 | null
6 | 11 | 4 | 10
**6** | **7** | **4** | **10**
6 | 15 | 8 | 28
6 | 15 | 8 | 28
6 | 7 | 8 | 28
6 | 12 | 10 | 65
6 | 9 | 10 | 65
6 | 13 | 11 | 86
In this case, for example, if we are looking at row (6, 7, 4), the frame
around this current row has 3 rows (6, 10, 1), (6, 11, 4) and (6, 7, 4). Then,
with `exclude group` clause, rows (6, 11, 4) and (6, 7, 4) will be excluded
because the values of order-by column `col3` on these 2 rows are the same, thus
in the same group. Therefore, the sum(col2) at current frame is 10.
For testcases are located in
hive/execution/**WindowFunctionExcludeSuite.scala**
#### How was this patch tested?
1. created unit tests to cover all framing cases, each with different
exclude types.
2. run regtest over catalyst, sql, and hive tests under SQL component.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/xwu0226/spark window_exclude
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/12147.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #12147
----
commit e783b1cc0450b3a3f4b98b30e095130955cd779b
Author: xin Wu <[email protected]>
Date: 2016-03-08T01:02:20Z
supporting EXCLUDE clause in Window function framing
commit c9e1dc4f5c30ac3051aa2de0a567a39d9dfb6fe3
Author: xin Wu <[email protected]>
Date: 2016-03-11T00:14:51Z
update
commit f32457de2df539a7e14fb584b4dc589202104548
Author: xin Wu <[email protected]>
Date: 2016-03-11T04:50:21Z
update testcase
commit fa03d45e11d9f02613c8df0c2b459e8d0b195262
Author: xin Wu <[email protected]>
Date: 2016-03-18T01:34:05Z
refining testcase for adding exclude clause for window function framing
commit 564642bdfa2be93dde9f5037bc97a394bb3199a8
Author: xin Wu <[email protected]>
Date: 2016-03-21T17:17:53Z
testcase update
commit 3a7b8f59a9d08de4fe4caf0576a683c49fc0e524
Author: xin Wu <[email protected]>
Date: 2016-03-22T17:21:16Z
more change
commit 50f2dffbd6c650ba734f9cfb4655bb21e50d30ab
Author: xin Wu <[email protected]>
Date: 2016-03-28T19:05:10Z
update
commit 82f474cca66ad8d205f69df510171f984372b96b
Author: xin Wu <[email protected]>
Date: 2016-03-28T23:37:10Z
recover my change lost
commit 2b28c87b3f860a57ad9a3db73a9b4747cba3a619
Author: xin Wu <[email protected]>
Date: 2016-03-29T01:54:42Z
more
commit a34dbceb0e65820bfe037ca759bb3fd9a04c9b2f
Author: xin Wu <[email protected]>
Date: 2016-03-30T23:19:21Z
modifed to support antlr4
commit 45adccbe2197b9e326cf319de266101470cd03c9
Author: xin Wu <[email protected]>
Date: 2016-03-31T14:48:34Z
modify testcases
commit 4c14e9b8c09eb44eb62c396fb453552881bb70c0
Author: xin Wu <[email protected]>
Date: 2016-04-03T02:31:49Z
modified upon review
commit 5dca37cf6b9ab9848f798412014b9ec59e46b2b3
Author: xin Wu <[email protected]>
Date: 2016-04-03T03:42:29Z
modify upon review
commit 1ae0d048e580491d8e7b37b4d61c7ea6ac37f03f
Author: xin Wu <[email protected]>
Date: 2016-04-03T05:30:27Z
update upon review
commit 1521ef3ffe6a0085a1eb718c3ab8c7c8c636f633
Author: xin Wu <[email protected]>
Date: 2016-04-03T18:11:50Z
move frame validation for exclude clause to analysis from parser
commit b88ff1d9fa0a5d94ac3166beb49aa7a2fbfac231
Author: xin Wu <[email protected]>
Date: 2016-04-03T18:23:17Z
add comment
commit a4ca051f2c5e8a03e855a39e31881f63d567d551
Author: xin Wu <[email protected]>
Date: 2016-04-03T20:51:19Z
add testcase for validating that exclude group/ties requires orderby clause
in the window specification
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]