GitHub user xwu0226 opened a pull request:

    https://github.com/apache/spark/pull/12147

    [SPARK-14361][SQL]Window function exclude clause 

    #### What changes were proposed in this pull request?
    
    The current Spark SQL does not support the `exclusion` clause, which is 
part of ANSI SQL2003’s `Window` syntax. For example, IBM Netezza fully 
supports it as shown in the [document web 
link](https://www.ibm.com/support/knowledgecenter/SSULQD_7.1.0/com.ibm.nz.dbu.doc/c_dbuser_window_aggregation_family_syntax.html).
 This PR is to fill the gap. 
    
    
    
    ##### Introduction
    
    Below is the ANSI SQL2003’s `Window` syntax:
    ```
    FUNCTION_NAME(expr) OVER {window_name | (window_specification)}
    window_specification ::= [window_name] [partitioning] [ordering] [framing]
    partitioning ::= PARTITION BY value[, value...] [COLLATE collation_name]
    ordering ::= ORDER [SIBLINGS] BY rule[, rule...]
    rule ::= {value | position | alias} [ASC | DESC] [NULLS {FIRST | LAST}]
    framing ::= {ROWS | RANGE} {start | between} [exclusion]
    start ::= {UNBOUNDED PRECEDING | unsigned-integer PRECEDING | CURRENT ROW}
    between ::= BETWEEN bound AND bound
    bound ::= {start | UNBOUNDED FOLLOWING | unsigned-integer FOLLOWING}
    exclusion ::= {EXCLUDE CURRENT ROW | EXCLUDE GROUP | EXCLUDE TIES | EXCLUDE 
NO OTHERS}
    ```
    Exclusion clause can be used to excluded certain rows from the window 
framing when calculating window aggregation function (e.g. AVG, SUM, MAX, MIN, 
COUNT, etc)  related to current row. Types of window functions that are not 
supported are listed below:
    
    1. Offset functions, such as lead(), lag()
    2. Ranking functions, such as rank(), dense_rank(), percent_rank(), 
cume_dist, ntile()
    3. Row number function, such as row_number()
    
    ##### Definition
    Syntax | Description
    ------------ | -------------
    EXCLUDE CURRENT ROW | Specifies excluding the current row.
    EXCLUDE GROUP | Specifies excluding the current row and all rows that are 
tied with it. Ties occur when there is a match on the order column or columns.
    EXCLUDE NO OTHERS | Specifies not excluding any rows. This value is the 
default if you specify no exclusion.
    EXCLUDE TIES | Specifies excluding all rows that are tied with the current 
row (peer rows), but retaining the current row.
    
    ##### Use-case Examples:
    
    - Let's say you want to find out for every employee, where is his/her 
salary at compared to the average salary of those within the same department 
and whose ages are within 5 years younger and older. The query could be:

    ```SQL
    SELECT NAME, DEPT_ID, SALARY, AGE, AVG(SALARY) AS AVG_WITHIN_5_YEAR
        OVER(PARTITION BY DEPT_ID 
             ORDER BY AGE 
             RANGE BETWEEN 5 PRECEDING AND 5 FOLLOWING 
             EXCLUDE CURRENT ROW) 
    FROM EMPLOYEE

    ```
    
    - Let's say you want to compare every customer's yearly purchase with other 
customers' average yearly purchase who are at different age group from the 
current customer. The query could be:

    ```SQL
    SELECT CUST_NAME, AGE, PROD_CATEGORY, YEARLY_PURCHASE, AVG(YEARLY_PURCHASE) 
        OVER(PARTITION BY PROD_CATEGORY 
             ORDER BY AGE 
             RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUND FOLLOWING 
             EXCLUDE GROUP) 
    FROM CUSTOMER_PURCHASE_SUM

    ```
    
    ##### Implementation 
    **catalyst.parser.SqlBase.g4**
    
    - add excludeClause definition and append to each case in windowFrame
    
    - add excludeType definition for CURRENT ROW,GOUP, TIES, NO OTHERS
    
    **catalyst.parser.AstBuilder.scala**
    
    - parse excludeClause to generate `ExcludeClause `object to set into 
`SpecifiedWindowFrame`
    
    **catalyst.expressions.windowExpressions.scala**
    
    - add `ExcludeClause `and `ExcludeType`
    
    - add `ExcludeClause `object as a parameter to `SpecifiedWindowFrame`
    
    - validate EXCLUDE GROUP/TIES requires ordered window frame
    
    **catalyst.analysis.Analyzer.scala**
    
    - validate exclude clause does not support `RowNumberLike`, `RankLike `and 
`OffsetWindowFunction`
    
    **execution.Window.scala**
    
    - implement the exclusion logic in sliding frame, expanding frame, 
shrinking frame and whole partition frame
    
    **Some Test case examples:**
    **DDL and DML:**
    sql("create table table1 (col1 int, col2 int, col3 int)")
          sql("insert into table1 select 6, 12, 10")
          sql("insert into table1 select 6, 11, 4")
          sql("insert into table1 select 6, 13, 11")
          sql("insert into table1 select 6, 9, 10")
          sql("insert into table1 select 6, 15, 8")
          sql("insert into table1 select 6, 10, 1")
          sql("insert into table1 select 6, 15, 8")
          sql("insert into table1 select 6, 7, 4")
          sql("insert into table1 select 6, 7, 8")
    
    **Queries:**
    ```
    // sliding frame with exclude current row
    select col1, col2, col3, sum(col2) 
        over (partition by col1 
           order by col3 
           rows between 2 preceding and 2 following 
           exclude current row) 
    from table1 where col1 = 6
    ```
    ##### Results
    col1 | col2 | col3 | sum(col2)
    ----- | ------ | ----- | -------
    6 | 10 | 1 | 18
    6 | 11 | 4 | 32
    **6** | **7** | **4** | **51**
    6 | 15 | 8 | 40
    6 | 15 | 8 | 40
    6 | 7 | 8 | 51
    6 | 12 | 10 | 44
    6 | 9 | 10 | 32
    6 | 6 | 13 | 11 | 21
    
    In this case, for example, if we are looking at the (6, 7, 4) row, the 
frame around this current row has 5 rows (6, 10, 1), (6,11, 4), **(6, 7, 4)**, 
(6, 15, 8), (6, 15, 8), including the current row. With the `exclude current 
row` clause, then the calculation of sum(col2) is on the other 4 rows, which is 
10 + 11 + 15 + 15 = 51. 
    
    ```
    //expanding frame with exclude current row
    select col1, col2, col3, sum(col2) 
        over (partition by col1 
                order by col3 
                rows between unbounded preceding and current row 
                exclude group) 
    from table1 where col1 = 6
    ```
    ##### Results
    col1 | col2 | col3 | sum(col2)
    ----- | ------ | ----- | -------
    6 | 10 | 1 | null
    6 | 11 | 4 | 10
    **6** | **7** | **4** | **10**
    6 | 15 | 8 | 28
    6 | 15 | 8 | 28
    6 | 7 | 8 | 28
    6 | 12 | 10 | 65
    6 | 9 | 10 | 65
    6 | 13 | 11 | 86
    
    In this case, for example, if we are looking at row (6, 7, 4), the frame 
around this current row has 3 rows (6, 10, 1), (6, 11, 4) and (6, 7, 4). Then, 
with `exclude group` clause, rows (6, 11, 4) and (6, 7, 4) will be excluded 
because the values of order-by column `col3` on these 2 rows are the same, thus 
in the same group. Therefore, the sum(col2) at current frame is 10. 
    
    For testcases are located in 
hive/execution/**WindowFunctionExcludeSuite.scala**
    
    #### How was this patch tested?
    1. created unit tests to cover all framing cases, each with different 
exclude types. 
    2. run regtest over catalyst, sql, and hive tests under SQL component.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/xwu0226/spark window_exclude

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/12147.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #12147
    
----
commit e783b1cc0450b3a3f4b98b30e095130955cd779b
Author: xin Wu <[email protected]>
Date:   2016-03-08T01:02:20Z

    supporting EXCLUDE clause in Window function framing

commit c9e1dc4f5c30ac3051aa2de0a567a39d9dfb6fe3
Author: xin Wu <[email protected]>
Date:   2016-03-11T00:14:51Z

    update

commit f32457de2df539a7e14fb584b4dc589202104548
Author: xin Wu <[email protected]>
Date:   2016-03-11T04:50:21Z

    update testcase

commit fa03d45e11d9f02613c8df0c2b459e8d0b195262
Author: xin Wu <[email protected]>
Date:   2016-03-18T01:34:05Z

    refining testcase for adding exclude clause for window function framing

commit 564642bdfa2be93dde9f5037bc97a394bb3199a8
Author: xin Wu <[email protected]>
Date:   2016-03-21T17:17:53Z

    testcase update

commit 3a7b8f59a9d08de4fe4caf0576a683c49fc0e524
Author: xin Wu <[email protected]>
Date:   2016-03-22T17:21:16Z

    more change

commit 50f2dffbd6c650ba734f9cfb4655bb21e50d30ab
Author: xin Wu <[email protected]>
Date:   2016-03-28T19:05:10Z

    update

commit 82f474cca66ad8d205f69df510171f984372b96b
Author: xin Wu <[email protected]>
Date:   2016-03-28T23:37:10Z

    recover my change lost

commit 2b28c87b3f860a57ad9a3db73a9b4747cba3a619
Author: xin Wu <[email protected]>
Date:   2016-03-29T01:54:42Z

    more

commit a34dbceb0e65820bfe037ca759bb3fd9a04c9b2f
Author: xin Wu <[email protected]>
Date:   2016-03-30T23:19:21Z

    modifed to support antlr4

commit 45adccbe2197b9e326cf319de266101470cd03c9
Author: xin Wu <[email protected]>
Date:   2016-03-31T14:48:34Z

    modify testcases

commit 4c14e9b8c09eb44eb62c396fb453552881bb70c0
Author: xin Wu <[email protected]>
Date:   2016-04-03T02:31:49Z

    modified upon review

commit 5dca37cf6b9ab9848f798412014b9ec59e46b2b3
Author: xin Wu <[email protected]>
Date:   2016-04-03T03:42:29Z

    modify upon review

commit 1ae0d048e580491d8e7b37b4d61c7ea6ac37f03f
Author: xin Wu <[email protected]>
Date:   2016-04-03T05:30:27Z

    update upon review

commit 1521ef3ffe6a0085a1eb718c3ab8c7c8c636f633
Author: xin Wu <[email protected]>
Date:   2016-04-03T18:11:50Z

    move frame validation for exclude clause to analysis from parser

commit b88ff1d9fa0a5d94ac3166beb49aa7a2fbfac231
Author: xin Wu <[email protected]>
Date:   2016-04-03T18:23:17Z

    add comment

commit a4ca051f2c5e8a03e855a39e31881f63d567d551
Author: xin Wu <[email protected]>
Date:   2016-04-03T20:51:19Z

    add testcase for validating that exclude group/ties requires orderby clause 
in the window specification

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to