metesynnada opened a new pull request, #3570:
URL: https://github.com/apache/arrow-datafusion/pull/3570

   ****Which issue does this PR close?****
   
   We offer a partial implementation of [windows with custom window 
frames](https://www.postgresql.org/docs/current/sql-expressions.html#SYNTAX-WINDOW-FUNCTIONS)
 and improve the situation onĀ 
[#361](https://github.com/apache/arrow-datafusion/issues/361).
   
   As a team, this is our first contribution to Datafusion and we hope to 
contribute further to both Datafusion and Ballista in the future. Since we are 
creating a PR for this project for the first time, we would like to get 
feedback on how we are doing in terms of code quality, alignment with the 
project roadmap etc. We also would like to get your ideas on how to close this 
issue completely since we are providing a partial implementation as a first 
step. You can see which cases we cover in integration tests.
   
   ****Rationale for this change****
   
   Datafusion currently does not support custom window frames, but it is on the 
[roadmap](https://arrow.apache.org/datafusion/user-guide/sql/sql_status.html).
   
   ****What changes are included in this PR?****
   
   For now, we implemented `ROWS` and `RANGE` modes supporting `PRECEDING` and 
`FOLLOWING`.
   
   As a draft, we currently do not support:
   
   - `GROUPS` mode
   - Timestamp ranges; e.g. `RANGE BETWEEN '1 day' PRECEDING AND '10 days' 
FOLLOWING` since the logical planner does not support the types other than an 
integer.
   - Frame exclusion, i.e `EXCLUDE CURRENT ROW`
   
   **Next steps**
   
   - `GROUPS` mode implementation by extending `calculate_current_window` 
method.
   - Frame exclusion, by logical planner extension and adapting  
`calculate_current_window` method.
   
   **Observations**
   
   - Some aggregation function implementations are not generic, but use `f64`. 
This can create issues with statistical aggregation functions like `CORR(x, y)` 
when greater precision is required. Fortunately, they can be enhanced to 
support other data types similar to `SUM(x)` aggregation.
       
       Also, `evaluation()` of the `CovarianceAccumulator` should be
       
       ```
       @ -374,12 +374,6 @@ impl Accumulator for CovarianceAccumulator {
         };
       
         if count <= 1 {
       -      return Err(DataFusionError::Internal(
       -          "At least two values are needed to calculate 
covariance".to_string(),
       -      ));
       -  }
       -
       -  if self.count == 0 {
             Ok(ScalarValue::Float64(None))
         } else {
             Ok(ScalarValue::Float64(Some(self.algo_const / count as f64)))
       ```
       
       to become compatible with PostgreSQL. However, these issues are separate 
from this PR and we can discuss them under a new issue. For this reason, we 
deferred supporting functions like `CORR(x, y)` to the future.
       
   - Since unstable sorting is used, some queries output different results than 
PostgreSQL. We use only unique columns for `ORDER BY` clauses while testing 
`ROWS` mode.
   
       An example query:
   
       ```sql
       SELECT c2, c3,
          SUM(c2) OVER(ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) as summation2,
          SUM(c3) OVER(ORDER BY c2 ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) as 
summation3,
          SUM(c3) OVER(ORDER BY c1 ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) as 
summation4
       FROM test
       LIMIT 10;
       ```
   
       The Datafusion output is:
   
       ```sql
       +----+-----+------------+------------+------------+
       | c2 | c3  | summation2 | summation3 | summation4 |
       +----+-----+------------+------------+------------+
       | 1  | 12  | 2          | 132        | -13        |
       | 1  | 120 | 3          | 203        | 263        |
       | 1  | 71  | 4          | 118        | 447        |
       | 1  | -85 | 5          | 154        | 37         |
       | 1  | 36  | 5          | 43         | 358        |
       | 1  | -99 | 5          | 48         | -81        |
       | 1  | 125 | 5          | -31        | 247        |
       | 1  | -8  | 5          | 111        | 215        |
       | 1  | 57  | 5          | 3          | 238        |
       | 1  | -72 | 5          | 140        | 247        |
       +----+-----+------------+------------+------------+
       ```
   
       and in PostgreSQL as
   
       ```sql
       + --- + --- + ---------- + ---------- + ---------- +
       | c2  | c3  | summation2 | summation3 | summation4 |
       | --- | --- | ---------- | ---------- | ---------- |
       | 1   | -85 | 2          | -49        | -251       |
       | 1   | 36  | 3          | 71         | 330        |
       | 1   | 120 | 4          | 46         | 284        |
       | 1   | -25 | 5          | 149        | -184       |
       | 1   | 103 | 5          | 305        | -15        |
       | 1   | 71  | 5          | 323        | 251        |
       | 1   | 54  | 5          | 286        | 48         |
       | 1   | 83  | 5          | 255        | 166        |
       | 1   | -56 | 5          | 222        | -79        |
       | 1   | 70  | 5          | 180        | -233       |
       +-----+-----+------------+------------+------------+
       ```
   
   - There is a minor problem in the logical planner, it should run
   
       ```sql
       SELECT 
         SUM(c2) OVER(ORDER BY c5, c6 RANGE BETWEEN UNBOUNDED PRECEDING AND 
UNBOUNDED FOLLOWING) as sum
       FROM test;
       ```
   
       without a problem, however, it produces
   
       ```
       Error: Plan("With window frame of type RANGE, the order by expression 
must be of length 1, got 2")
       ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to