metesynnada opened a new pull request, #3570: URL: https://github.com/apache/arrow-datafusion/pull/3570
****Which issue does this PR close?**** We offer a partial implementation of [windows with custom window frames](https://www.postgresql.org/docs/current/sql-expressions.html#SYNTAX-WINDOW-FUNCTIONS) and improve the situation onĀ [#361](https://github.com/apache/arrow-datafusion/issues/361). As a team, this is our first contribution to Datafusion and we hope to contribute further to both Datafusion and Ballista in the future. Since we are creating a PR for this project for the first time, we would like to get feedback on how we are doing in terms of code quality, alignment with the project roadmap etc. We also would like to get your ideas on how to close this issue completely since we are providing a partial implementation as a first step. You can see which cases we cover in integration tests. ****Rationale for this change**** Datafusion currently does not support custom window frames, but it is on the [roadmap](https://arrow.apache.org/datafusion/user-guide/sql/sql_status.html). ****What changes are included in this PR?**** For now, we implemented `ROWS` and `RANGE` modes supporting `PRECEDING` and `FOLLOWING`. As a draft, we currently do not support: - `GROUPS` mode - Timestamp ranges; e.g. `RANGE BETWEEN '1 day' PRECEDING AND '10 days' FOLLOWING` since the logical planner does not support the types other than an integer. - Frame exclusion, i.e `EXCLUDE CURRENT ROW` **Next steps** - `GROUPS` mode implementation by extending `calculate_current_window` method. - Frame exclusion, by logical planner extension and adapting `calculate_current_window` method. **Observations** - Some aggregation function implementations are not generic, but use `f64`. This can create issues with statistical aggregation functions like `CORR(x, y)` when greater precision is required. Fortunately, they can be enhanced to support other data types similar to `SUM(x)` aggregation. Also, `evaluation()` of the `CovarianceAccumulator` should be ``` @ -374,12 +374,6 @@ impl Accumulator for CovarianceAccumulator { }; if count <= 1 { - return Err(DataFusionError::Internal( - "At least two values are needed to calculate covariance".to_string(), - )); - } - - if self.count == 0 { Ok(ScalarValue::Float64(None)) } else { Ok(ScalarValue::Float64(Some(self.algo_const / count as f64))) ``` to become compatible with PostgreSQL. However, these issues are separate from this PR and we can discuss them under a new issue. For this reason, we deferred supporting functions like `CORR(x, y)` to the future. - Since unstable sorting is used, some queries output different results than PostgreSQL. We use only unique columns for `ORDER BY` clauses while testing `ROWS` mode. An example query: ```sql SELECT c2, c3, SUM(c2) OVER(ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) as summation2, SUM(c3) OVER(ORDER BY c2 ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) as summation3, SUM(c3) OVER(ORDER BY c1 ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) as summation4 FROM test LIMIT 10; ``` The Datafusion output is: ```sql +----+-----+------------+------------+------------+ | c2 | c3 | summation2 | summation3 | summation4 | +----+-----+------------+------------+------------+ | 1 | 12 | 2 | 132 | -13 | | 1 | 120 | 3 | 203 | 263 | | 1 | 71 | 4 | 118 | 447 | | 1 | -85 | 5 | 154 | 37 | | 1 | 36 | 5 | 43 | 358 | | 1 | -99 | 5 | 48 | -81 | | 1 | 125 | 5 | -31 | 247 | | 1 | -8 | 5 | 111 | 215 | | 1 | 57 | 5 | 3 | 238 | | 1 | -72 | 5 | 140 | 247 | +----+-----+------------+------------+------------+ ``` and in PostgreSQL as ```sql + --- + --- + ---------- + ---------- + ---------- + | c2 | c3 | summation2 | summation3 | summation4 | | --- | --- | ---------- | ---------- | ---------- | | 1 | -85 | 2 | -49 | -251 | | 1 | 36 | 3 | 71 | 330 | | 1 | 120 | 4 | 46 | 284 | | 1 | -25 | 5 | 149 | -184 | | 1 | 103 | 5 | 305 | -15 | | 1 | 71 | 5 | 323 | 251 | | 1 | 54 | 5 | 286 | 48 | | 1 | 83 | 5 | 255 | 166 | | 1 | -56 | 5 | 222 | -79 | | 1 | 70 | 5 | 180 | -233 | +-----+-----+------------+------------+------------+ ``` - There is a minor problem in the logical planner, it should run ```sql SELECT SUM(c2) OVER(ORDER BY c5, c6 RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) as sum FROM test; ``` without a problem, however, it produces ``` Error: Plan("With window frame of type RANGE, the order by expression must be of length 1, got 2") ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
