Jefffrey commented on PR #22102:
URL: https://github.com/apache/datafusion/pull/22102#issuecomment-4451305175

   Thanks for the very comprehensive reply, it helped me get my thoughts in 
order. So to recap a little (just to aid my own thought process):
   
   ## Current main state
   
   We introduced `all` (previously missing support) and added support for other 
operators for `any` since `=` was already supported; we also modified `=` to 
use a case structure to try match postgres semantics which caused breakage in 
parquet pruning.
   
   ## This PR
   
   Revert `= any` to be exactly the way it was on `53.1.0`, latest release, to 
prevent any breaking changes. Also modifies the other operators + `all` to try 
to align to postgres where possible.
   
   ## A question I have
   
   > For any `null` values in the haystack, PostgreSQL will always mark the 
expression as `null`, whereas we diverge slightly since we desugar to existing 
functions which filter out nulls:
   
   I'm testing on `PostgreSQL 18.2`, but this doesn't seem to be the case?
   
   ```sql
   postgres=# select 5 = any(array[null]::int[]);
    ?column?
   ----------
   
   (1 row)
   
   postgres=# select 5 = any(array[null,5]::int[]);
    ?column?
   ----------
    t
   (1 row)
   
   postgres=# select 5 = any(array[null,1]::int[]);
    ?column?
   ----------
   
   (1 row)
   ```
   
   - It seems to actually return `true` when it does find something, rather 
than always `null`?
   
   # What do we want to do moving forward
   
   So I might've mentioned it before, but we have two main motivations here. 
Most importantly is to preserve the pruning behaviour, which means not using 
the case approach we did in the previous PR. The other is to try implement 
all/any (and their various operators) trying to align as close to postgres as 
possible.
   
   ## Strictly no breaking changes path
   
   We maintain what we've had for `= any`, and build up support for `all` and 
`any` (for other operators) to try align as close to postgres as possible, but 
also ensuring it stays consistent with existing `= any`. This has potential for 
confusion as we essentially introducing our own semantics for these operators 
now; a DataFusion specific version. It looks like this PR aims to do that.
   
   ## Minor breaking changes path
   
   An alternative, which requires some breaking changes, is to modify `= any` 
to match postgres semantics (for example, how we treat nulls inside the 
haystack) **without** using the case structure, to ensure pruning doesn't break 
_but_ that behaviour does change slightly. Then we can implement the rest of 
the operators and `all` to strictly follow postgres, to try ensure we don't 
invent our own standards (at least for `any`/`all`...). It would be achieved by 
creating new UDFs I suppose, which has been mentioned before:
   
   - https://github.com/apache/datafusion/pull/21416#discussion_r3104850324
   
   # Path forward
   
   I think I'll need some more time to properly review this PR, especially to 
try understand & ensure the semantics we're introducing here for `all`/`any` 
are logically consistent whilst also not breaking `= any`. It sounds like we 
should avoid any breaking changes, as mentioned here by @alamb:
   
   - https://github.com/apache/datafusion/pull/21743#issuecomment-4412252130
   
   Though it's worth noting the main breakage here was the introduction of the 
case structure, rather than the semantics change (though maybe the semantics 
change itself is breakage enough).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to