Jefffrey commented on PR #22102:
URL: https://github.com/apache/datafusion/pull/22102#issuecomment-4451305175
Thanks for the very comprehensive reply, it helped me get my thoughts in
order. So to recap a little (just to aid my own thought process):
## Current main state
We introduced `all` (previously missing support) and added support for other
operators for `any` since `=` was already supported; we also modified `=` to
use a case structure to try match postgres semantics which caused breakage in
parquet pruning.
## This PR
Revert `= any` to be exactly the way it was on `53.1.0`, latest release, to
prevent any breaking changes. Also modifies the other operators + `all` to try
to align to postgres where possible.
## A question I have
> For any `null` values in the haystack, PostgreSQL will always mark the
expression as `null`, whereas we diverge slightly since we desugar to existing
functions which filter out nulls:
I'm testing on `PostgreSQL 18.2`, but this doesn't seem to be the case?
```sql
postgres=# select 5 = any(array[null]::int[]);
?column?
----------
(1 row)
postgres=# select 5 = any(array[null,5]::int[]);
?column?
----------
t
(1 row)
postgres=# select 5 = any(array[null,1]::int[]);
?column?
----------
(1 row)
```
- It seems to actually return `true` when it does find something, rather
than always `null`?
# What do we want to do moving forward
So I might've mentioned it before, but we have two main motivations here.
Most importantly is to preserve the pruning behaviour, which means not using
the case approach we did in the previous PR. The other is to try implement
all/any (and their various operators) trying to align as close to postgres as
possible.
## Strictly no breaking changes path
We maintain what we've had for `= any`, and build up support for `all` and
`any` (for other operators) to try align as close to postgres as possible, but
also ensuring it stays consistent with existing `= any`. This has potential for
confusion as we essentially introducing our own semantics for these operators
now; a DataFusion specific version. It looks like this PR aims to do that.
## Minor breaking changes path
An alternative, which requires some breaking changes, is to modify `= any`
to match postgres semantics (for example, how we treat nulls inside the
haystack) **without** using the case structure, to ensure pruning doesn't break
_but_ that behaviour does change slightly. Then we can implement the rest of
the operators and `all` to strictly follow postgres, to try ensure we don't
invent our own standards (at least for `any`/`all`...). It would be achieved by
creating new UDFs I suppose, which has been mentioned before:
- https://github.com/apache/datafusion/pull/21416#discussion_r3104850324
# Path forward
I think I'll need some more time to properly review this PR, especially to
try understand & ensure the semantics we're introducing here for `all`/`any`
are logically consistent whilst also not breaking `= any`. It sounds like we
should avoid any breaking changes, as mentioned here by @alamb:
- https://github.com/apache/datafusion/pull/21743#issuecomment-4412252130
Though it's worth noting the main breakage here was the introduction of the
case structure, rather than the semantics change (though maybe the semantics
change itself is breakage enough).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]