[GitHub] [arrow-datafusion] mustafasrepo opened a new pull request, #3475: Custom window frame

GitBox Wed, 14 Sep 2022 03:35:09 -0700


mustafasrepo opened a new pull request, #3475:
URL: https://github.com/apache/arrow-datafusion/pull/3475


   # Datafusion First PR
   
   ```
   export POSTGRES_DB=test
   export POSTGRES_USER=postgres
   export POSTGRES_HOST=localhost
   export POSTGRES_PORT=5432
   ```
   
   ****Which issue does this PR close?****
   
   We Implement [Window Function 
Calls](https://www.postgresql.org/docs/current/sql-expressions.html#SYNTAX-WINDOW-FUNCTIONS)
 of Postgres and improve the situation on 
[#360](https://github.com/apache/arrow-datafusion/issues/360).
   
   We started to contribute to Datafusion just now. Since we are creating a PR 
for the project for the first time, we would like to get your ideas to close 
this issue completely with a partial implementation for now. You can see which 
cases we cover in integration tests.
   
   ****Rationale for this change****
   
   Datafusion did not support window call functions, it is on the 
[roadmap](https://arrow.apache.org/datafusion/user-guide/sql/sql_status.html).
   
   ****What changes are included in this PR?****
   
   For now, we implemented `ROWS` and `RANGE` modes supporting `PRECEDING` and 
`FOLLOWING`.
   
   As a draft, we currently do not support 
   
   - `GROUPS` mode
   - Timestamp ranges i.e `RANGE BETWEEN '1 day' PRECEDING AND '10 days' 
FOLLOWING`
   - Frame exclusion, i.e `EXCLUDE CURRENT ROW`
   
   **Next steps**
   
   - `GROUPS` mode implementation by extending `calculate_current_window` 
method.
   - Frame exclusion, by logic planner extension and adapting  
`calculate_current_window` method.
   
   **Observations**
   
   - Since some aggregation functions only use `f64`, there are numerical 
problems with statistical aggregation functions like `CORR(x,y)`, and they can 
be enhanced to support other DataTypes similar to `SUM(x)` aggregation.
       
       Also, `evaluation()` of the `CovarianceAccumulator` should be
       
       ```
       @ -374,12 +374,6 @@ impl Accumulator for CovarianceAccumulator {
         };
       
         if count <= 1 {
       -      return Err(DataFusionError::Internal(
       -          "At least two values are needed to calculate 
covariance".to_string(),
       -      ));
       -  }
       -
       -  if self.count == 0 {
             Ok(ScalarValue::Float64(None))
         } else {
             Ok(ScalarValue::Float64(Some(self.algo_const / count as f64)))
       ```
       
       /
       
       to become the same as PostgreSQL. However, these issues are separate 
from this PR. We did not use `CORR(x,y)` because of these problems.
       
   - Since the sorting is unstable, some queries output different results than 
PostgreSQL. We use only unique columns for `ORDER BY` clauses while testing 
`ROWS` mode.
   
   An example query:
   
   ```sql
   SELECT c2, c3,
      SUM(c2) OVER(ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) as summation2,
      SUM(c3) OVER(ORDER BY c2 ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) as 
summation3,
      SUM(c3) OVER(ORDER BY c1 ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) as 
summation4
   FROM test
   LIMIT 10;
   ```
   
   Outputs in Datafusion as
   
   ```sql
   +----+-----+------------+------------+------------+
   | c2 | c3  | summation2 | summation3 | summation4 |
   +----+-----+------------+------------+------------+
   | 1  | 12  | 2          | 132        | -13        |
   | 1  | 120 | 3          | 203        | 263        |
   | 1  | 71  | 4          | 118        | 447        |
   | 1  | -85 | 5          | 154        | 37         |
   | 1  | 36  | 5          | 43         | 358        |
   | 1  | -99 | 5          | 48         | -81        |
   | 1  | 125 | 5          | -31        | 247        |
   | 1  | -8  | 5          | 111        | 215        |
   | 1  | 57  | 5          | 3          | 238        |
   | 1  | -72 | 5          | 140        | 247        |
   +----+-----+------------+------------+------------+
   ```
   
   and in PostgreSQL as
   
   ```sql
   | c2  | c3  | summation2 | summation3 | summation4 |
   | --- | --- | ---------- | ---------- | ---------- |
   | 1   | -85 | 2          | -49        | -251       |
   | 1   | 36  | 3          | 71         | 330        |
   | 1   | 120 | 4          | 46         | 284        |
   | 1   | -25 | 5          | 149        | -184       |
   | 1   | 103 | 5          | 305        | -15        |
   | 1   | 71  | 5          | 323        | 251        |
   | 1   | 54  | 5          | 286        | 48         |
   | 1   | 83  | 5          | 255        | 166        |
   | 1   | -56 | 5          | 222        | -79        |
   | 1   | 70  | 5          | 180        | -233       |
   +-----+-----+------------+------------+------------+
   ```
   
   - There is a minor problem in the logical planner, it should run
   
   ```sql
   SELECT 
    SUM(c2) OVER(ORDER BY c5, c6 RANGE BETWEEN UNBOUNDED PRECEDING AND 
UNBOUNDED FOLLOWING) as sum
   FROM test;
   ```
   
   without a problem, however, it produces
   
   ```
   Error: Plan("With window frame of type RANGE, the order by expression must 
be of length 1, got 2")
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] mustafasrepo opened a new pull request, #3475: Custom window frame

Reply via email to