cetra3 commented on PR #22102:
URL: https://github.com/apache/datafusion/pull/22102#issuecomment-4448190995

   So I am basing this PR relative to the latest tagged release, which is 
`53.1.0`.   This release *only* supported the `needle = ANY(array)` shape and 
nothing else. 
   
   For this one shape, this PR adjusts the behaviour *back* to `53.1.0` and how 
it desugared to `array_has`.  No other shapes were supported.
   
   But also, in this PR, I have tried to make *simple* rules about what 
expressions get desugared to.  Essentially all non-nullable behaviour matches 
semantics like you'd expect, it's just weird edge cases around some expressions 
with null values that diverge.
   
   I'm open to expanding/adjusting this, as long as it doesn't impact the 
existing use case (`= ANY`), but I think this PR is a balance of pragmatism & 
correctness.
   
   Here's some example desugaring:
   
   `needle OP ANY(haystack)`:
   
   | op   | desugar                                                             
                             |
   
|------|--------------------------------------------------------------------------------------------------|
   | `=`  | `array_has(haystack, needle)`                                       
                             |
   | `<>` | `cardinality(haystack) > 0 AND (array_min(haystack) <> needle OR 
array_max(haystack) <> needle)` |
   | `>`  | `cardinality(haystack) > 0 AND needle > array_min(haystack)`        
                             |
   | `<`  | `cardinality(haystack) > 0 AND needle < array_max(haystack)`        
                             |
   | `>=` | `cardinality(haystack) > 0 AND needle >= array_min(haystack)`       
                             |
   | `<=` | `cardinality(haystack) > 0 AND needle <= array_max(haystack)`       
                             |
   
   `needle OP ALL(haystack)`:
   
   | op   | desugar                                                             
                             |
   
|------|--------------------------------------------------------------------------------------------------|
   | `=`  | `cardinality(haystack) = 0 OR (array_min(haystack) = needle AND 
array_max(haystack) = needle)`   |
   | `<>` | `NOT array_has(haystack, needle)`                                   
                             |
   | `>`  | `cardinality(haystack) = 0 OR needle > array_max(haystack)`         
                             |
   | `<`  | `cardinality(haystack) = 0 OR needle < array_min(haystack)`         
                             |
   | `>=` | `cardinality(haystack) = 0 OR needle >= array_max(haystack)`        
                             |
   | `<=` | `cardinality(haystack) = 0 OR needle <= array_min(haystack)`        
                             |
   
   ### Cardinality Check
   
   The cardinality check is there to deal with empty haystacks and ensuring we 
return a boolean true/false rather than null.  If we desugared to 
`array_min([])` directly we'd get `null` values back.  So the cardinality check 
is there to help make that a bit nicer.
   
   Here's a table:
   
   | Expression                   | `cardinality` | `array_min` / `array_max` | 
Desugar evaluated                                | Result |
   
|------------------------------|---------------|----------------------------|--------------------------------------------------|--------|
   | `5 > ANY([])`                | 0             | NULL / NULL                
| `0 > 0 AND (5 > NULL)` → `FALSE AND NULL`        | **F**  |
   | `5 > ALL([])`                | 0             | NULL / NULL                
| `0 = 0 OR (5 > NULL)` → `TRUE OR NULL`           | **T**  |
   | `5 > ANY([3, 7])`            | 2             | 3 / 7                      
| `2 > 0 AND 5 > 3` → `TRUE AND TRUE`              | **T**  |
   | `5 > ALL([3, 7])`            | 2             | 3 / 7                      
| `2 = 0 OR 5 > 7` → `FALSE OR FALSE`              | **F**  |
   | `5 > ANY([3, NULL])`         | 2             | 3 / 3                      
| `2 > 0 AND 5 > 3` → `TRUE AND TRUE`              | **T**  |
   | `5 > ALL([3, NULL])`         | 2             | 3 / 3                      
| `2 = 0 OR 5 > 3` → `FALSE OR TRUE`               | **T**  |
   | `5 > ANY([6, NULL])`         | 2             | 6 / 6                      
| `2 > 0 AND 5 > 6` → `TRUE AND FALSE`             | **F**  |
   
   ### PostgreSQL Divergence
   
   It's when you start mixing in null values to needles and haystacks that 
things diverge from other systems. Each of these functions treat `null` as 
`absent`, whereas `null` in PostgreSQL semantics is treated at `not defined`:
   
   * For `null` needles, pretty much every shape is the same except two 
exceptions around *empty* haystacks (which are weird edge cases):
   
   | Expression                          | `cardinality` | This PR    | 
PostgreSQL    |
   |-------------------------------------|---------------|-------|-------|
   | `NULL = ANY([])`                    | 0             | N     | F     |
   | `NULL <> ALL([])`                   | 0             | N     | T     |
   
   
   * For any `null` values in the haystack, PostgreSQL will always mark the 
expression as `null`, whereas we diverge slightly since we desugar to existing 
functions which filter out nulls:
   
   | Expression                    | This PR     | PostgreSQL    | Why          
                                              |
   
|-------------------------------|--------|-------|------------------------------------------------------------|
   | `5 = ANY([NULL, NULL])`       | **F**  | N     | `array_has([NULL,NULL], 
5)` = FALSE. PG: `5=N OR 5=N` = N. |
   | `5 = ANY([3, NULL])`          | **F**  | N     | `array_has([3,NULL], 5)` 
= FALSE. PG: `5=3 OR 5=N` = `F OR N` = N. |
   | `5 <> ALL([NULL, NULL])`      | **T**  | N     | `NOT array_has(…)` = NOT 
FALSE = TRUE. PG: `5<>N AND 5<>N` = N. |
   | `5 <> ALL([3, NULL])`         | **T**  | N     | Same shape.               
                                 |
   | `5 > ALL([3, NULL])`          | **T**  | N     | `array_max([3,NULL])` = 
3, so `5 > 3` = TRUE. PG: `5>3 AND 5>N` = `T AND N` = N. |
   | `5 < ANY([3, NULL])`          | **F**  | N     | `array_max([3,NULL])` = 
3, so `5 < 3` = FALSE. PG: `5<3 OR 5<N` = `F OR N` = N. |
   | `5 >= ALL([3, NULL])`         | **T**  | N     | Same pattern as `>`.      
                                            |
   | `5 = ALL([5, NULL])`          | **T**  | N     | min=max=5, so both `5=5`. 
PG: `5=5 AND 5=N` = `T AND N` = N. (Subtle — needle equals the only non-NULL.) |
   | `5 <> ANY([5, NULL])`         | **F**  | N     | min=max=5, so both `5<>5` 
= FALSE. PG: `5<>5 OR 5<>N` = `F OR N` = N. |
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to