vbarua opened a new issue, #15006: URL: https://github.com/apache/datafusion/issues/15006
### Describe the bug ## Background In the SQL standard, RESPECT NULLS and IGNORE NULLS are options to be set for the `lead`, `lag`, `first_value`, `last_value` and `nth_value` _window_ functions. > Some window functions permit a null_treatment clause that specifies how to handle NULL values when calculating results. This clause is optional. It is part of the SQL standard, but the MySQL implementation permits only RESPECT NULLS (which is also the default). [MySQL](https://dev.mysql.com/doc/refman/9.2/en/window-function-descriptions.html) > The SQL standard defines a RESPECT NULLS or IGNORE NULLS option for lead, lag, first_value, last_value, and nth_value. This is not implemented in PostgreSQL: the behavior is always the same as the standard's default, namely RESPECT NULLS. Likewise, the standard's FROM FIRST or FROM LAST option for nth_value is not implemented: only the default FROM FIRST behavior is supported. (You can achieve the result of FROM LAST by reversing the ORDER BY ordering.) [Postgres](https://www.postgresql.org/docs/17/functions-window.html) While MySQL and Postgres don't support the IGNORE NULLS options, there are systems that support it for those functions: * [Trino](https://trino.io/docs/current/functions/window.html) * [SqlServer](https://learn.microsoft.com/en-us/sql/t-sql/functions/lag-transact-sql?view=sql-server-ver16#-ignore-nulls--respect-nulls-) * [Snowflake](https://docs.snowflake.com/en/sql-reference/functions/lag) ## Bug DataFusion permits queries like ```sql SELECT FIRST_VALUE(column1) RESPECT NULLS FROM t; ``` and ```sql SELECT FIRST_VALUE(column1) IGNORE NULLS FROM t; ``` and in fact has [tests](https://github.com/apache/datafusion/blob/dd0fd889ea603f929accb99002e2f99280823f5c/datafusion/sqllogictest/test_files/aggregate.slt#L5866-L5874) for them. This doesn't conform with how most engines work with these options. A more "standard" version of these IGNORE NULLS would look like ```sql SELECT FIRST_VALUE(column1) FILTER (WHERE column1 IS NOT NULL) FROM t; ``` That being said, part of the weirdness here is that DataFusion defines `first_value` both as an [aggregate function](https://github.com/apache/datafusion/blob/main/datafusion/functions-aggregate/src/first_last.rs) and as a [window function](https://github.com/apache/datafusion/blob/dd0fd889ea603f929accb99002e2f99280823f5c/datafusion/functions-window/src/nth_value.rs#L41-L46). Other engines treat `first_value` solely as a window function, and provide something like `any_value` ([Trino](https://trino.io/docs/current/functions/aggregate.html#any_value), [Postgres](https://www.postgresql.org/docs/17/functions-aggregate.html), [Snowflake](https://docs.snowflake.com/en/sql-reference/functions/any_value)). ### To Reproduce See tests for RESPECT NULLS and IGNORE NULLS in the codebase https://github.com/apache/datafusion/blob/dd0fd889ea603f929accb99002e2f99280823f5c/datafusion/sqllogictest/test_files/aggregate.slt#L5866-L5874 ### Expected behavior To better conform with standard SQL behaviour, at the SQL level: 1. `RESPECT NULLS` and `IGNORE NULLS` should not be valid options for _aggregate_ functions. 2. `RESPECT NULLS` and `IGNORE NULLS` should be _only_ valid for the following _window_ functions: `lead`, `lag`, `first_value`, `last_value`, `nth_value` ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
