mustafasrepo opened a new pull request, #3475: URL: https://github.com/apache/arrow-datafusion/pull/3475
# Datafusion First PR ``` export POSTGRES_DB=test export POSTGRES_USER=postgres export POSTGRES_HOST=localhost export POSTGRES_PORT=5432 ``` ****Which issue does this PR close?**** We Implement [Window Function Calls](https://www.postgresql.org/docs/current/sql-expressions.html#SYNTAX-WINDOW-FUNCTIONS) of Postgres and improve the situation onĀ [#360](https://github.com/apache/arrow-datafusion/issues/360). We started to contribute to Datafusion just now. Since we are creating a PR for the project for the first time, we would like to get your ideas to close this issue completely with a partial implementation for now. You can see which cases we cover in integration tests. ****Rationale for this change**** Datafusion did not support window call functions, it is on the [roadmap](https://arrow.apache.org/datafusion/user-guide/sql/sql_status.html). ****What changes are included in this PR?**** For now, we implemented `ROWS` and `RANGE` modes supporting `PRECEDING` and `FOLLOWING`. As a draft, we currently do not support - `GROUPS` mode - Timestamp ranges i.e `RANGE BETWEEN '1 day' PRECEDING AND '10 days' FOLLOWING` - Frame exclusion, i.e `EXCLUDE CURRENT ROW` **Next steps** - `GROUPS` mode implementation by extending `calculate_current_window` method. - Frame exclusion, by logic planner extension and adapting `calculate_current_window` method. **Observations** - Since some aggregation functions only use `f64`, there are numerical problems with statistical aggregation functions like `CORR(x,y)`, and they can be enhanced to support other DataTypes similar to `SUM(x)` aggregation. Also, `evaluation()` of the `CovarianceAccumulator` should be ``` @ -374,12 +374,6 @@ impl Accumulator for CovarianceAccumulator { }; if count <= 1 { - return Err(DataFusionError::Internal( - "At least two values are needed to calculate covariance".to_string(), - )); - } - - if self.count == 0 { Ok(ScalarValue::Float64(None)) } else { Ok(ScalarValue::Float64(Some(self.algo_const / count as f64))) ``` / to become the same as PostgreSQL. However, these issues are separate from this PR. We did not use `CORR(x,y)` because of these problems. - Since the sorting is unstable, some queries output different results than PostgreSQL. We use only unique columns for `ORDER BY` clauses while testing `ROWS` mode. An example query: ```sql SELECT c2, c3, SUM(c2) OVER(ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) as summation2, SUM(c3) OVER(ORDER BY c2 ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) as summation3, SUM(c3) OVER(ORDER BY c1 ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) as summation4 FROM test LIMIT 10; ``` Outputs in Datafusion as ```sql +----+-----+------------+------------+------------+ | c2 | c3 | summation2 | summation3 | summation4 | +----+-----+------------+------------+------------+ | 1 | 12 | 2 | 132 | -13 | | 1 | 120 | 3 | 203 | 263 | | 1 | 71 | 4 | 118 | 447 | | 1 | -85 | 5 | 154 | 37 | | 1 | 36 | 5 | 43 | 358 | | 1 | -99 | 5 | 48 | -81 | | 1 | 125 | 5 | -31 | 247 | | 1 | -8 | 5 | 111 | 215 | | 1 | 57 | 5 | 3 | 238 | | 1 | -72 | 5 | 140 | 247 | +----+-----+------------+------------+------------+ ``` and in PostgreSQL as ```sql | c2 | c3 | summation2 | summation3 | summation4 | | --- | --- | ---------- | ---------- | ---------- | | 1 | -85 | 2 | -49 | -251 | | 1 | 36 | 3 | 71 | 330 | | 1 | 120 | 4 | 46 | 284 | | 1 | -25 | 5 | 149 | -184 | | 1 | 103 | 5 | 305 | -15 | | 1 | 71 | 5 | 323 | 251 | | 1 | 54 | 5 | 286 | 48 | | 1 | 83 | 5 | 255 | 166 | | 1 | -56 | 5 | 222 | -79 | | 1 | 70 | 5 | 180 | -233 | +-----+-----+------------+------------+------------+ ``` - There is a minor problem in the logical planner, it should run ```sql SELECT SUM(c2) OVER(ORDER BY c5, c6 RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) as sum FROM test; ``` without a problem, however, it produces ``` Error: Plan("With window frame of type RANGE, the order by expression must be of length 1, got 2") ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
