cetra3 commented on PR #22102:
URL: https://github.com/apache/datafusion/pull/22102#issuecomment-4448190995
So I am basing this PR relative to the latest tagged release, which is
`53.1.0`. This release *only* supported the `needle = ANY(array)` shape and
nothing else.
For this one shape, this PR adjusts the behaviour *back* to `53.1.0` and how
it desugared to `array_has`. No other shapes were supported.
But also, in this PR, I have tried to make *simple* rules about what
expressions get desugared to. Essentially all non-nullable behaviour matches
semantics like you'd expect, it's just weird edge cases around some expressions
with null values that diverge.
I'm open to expanding/adjusting this, as long as it doesn't impact the
existing use case (`= ANY`), but I think this PR is a balance of pragmatism &
correctness.
Here's some example desugaring:
`needle OP ANY(haystack)`:
| op | desugar
|
|------|--------------------------------------------------------------------------------------------------|
| `=` | `array_has(haystack, needle)`
|
| `<>` | `cardinality(haystack) > 0 AND (array_min(haystack) <> needle OR
array_max(haystack) <> needle)` |
| `>` | `cardinality(haystack) > 0 AND needle > array_min(haystack)`
|
| `<` | `cardinality(haystack) > 0 AND needle < array_max(haystack)`
|
| `>=` | `cardinality(haystack) > 0 AND needle >= array_min(haystack)`
|
| `<=` | `cardinality(haystack) > 0 AND needle <= array_max(haystack)`
|
`needle OP ALL(haystack)`:
| op | desugar
|
|------|--------------------------------------------------------------------------------------------------|
| `=` | `cardinality(haystack) = 0 OR (array_min(haystack) = needle AND
array_max(haystack) = needle)` |
| `<>` | `NOT array_has(haystack, needle)`
|
| `>` | `cardinality(haystack) = 0 OR needle > array_max(haystack)`
|
| `<` | `cardinality(haystack) = 0 OR needle < array_min(haystack)`
|
| `>=` | `cardinality(haystack) = 0 OR needle >= array_max(haystack)`
|
| `<=` | `cardinality(haystack) = 0 OR needle <= array_min(haystack)`
|
### Cardinality Check
The cardinality check is there to deal with empty haystacks and ensuring we
return a boolean true/false rather than null. If we desugared to
`array_min([])` directly we'd get `null` values back. So the cardinality check
is there to help make that a bit nicer.
Here's a table:
| Expression | `cardinality` | `array_min` / `array_max` |
Desugar evaluated | Result |
|------------------------------|---------------|----------------------------|--------------------------------------------------|--------|
| `5 > ANY([])` | 0 | NULL / NULL
| `0 > 0 AND (5 > NULL)` → `FALSE AND NULL` | **F** |
| `5 > ALL([])` | 0 | NULL / NULL
| `0 = 0 OR (5 > NULL)` → `TRUE OR NULL` | **T** |
| `5 > ANY([3, 7])` | 2 | 3 / 7
| `2 > 0 AND 5 > 3` → `TRUE AND TRUE` | **T** |
| `5 > ALL([3, 7])` | 2 | 3 / 7
| `2 = 0 OR 5 > 7` → `FALSE OR FALSE` | **F** |
| `5 > ANY([3, NULL])` | 2 | 3 / 3
| `2 > 0 AND 5 > 3` → `TRUE AND TRUE` | **T** |
| `5 > ALL([3, NULL])` | 2 | 3 / 3
| `2 = 0 OR 5 > 3` → `FALSE OR TRUE` | **T** |
| `5 > ANY([6, NULL])` | 2 | 6 / 6
| `2 > 0 AND 5 > 6` → `TRUE AND FALSE` | **F** |
### PostgreSQL Divergence
It's when you start mixing in null values to needles and haystacks that
things diverge from other systems. Each of these functions treat `null` as
`absent`, whereas `null` in PostgreSQL semantics is treated at `not defined`:
* For `null` needles, pretty much every shape is the same except two
exceptions around *empty* haystacks (which are weird edge cases):
| Expression | `cardinality` | This PR |
PostgreSQL |
|-------------------------------------|---------------|-------|-------|
| `NULL = ANY([])` | 0 | N | F |
| `NULL <> ALL([])` | 0 | N | T |
* For any `null` values in the haystack, PostgreSQL will always mark the
expression as `null`, whereas we diverge slightly since we desugar to existing
functions which filter out nulls:
| Expression | This PR | PostgreSQL | Why
|
|-------------------------------|--------|-------|------------------------------------------------------------|
| `5 = ANY([NULL, NULL])` | **F** | N | `array_has([NULL,NULL],
5)` = FALSE. PG: `5=N OR 5=N` = N. |
| `5 = ANY([3, NULL])` | **F** | N | `array_has([3,NULL], 5)`
= FALSE. PG: `5=3 OR 5=N` = `F OR N` = N. |
| `5 <> ALL([NULL, NULL])` | **T** | N | `NOT array_has(…)` = NOT
FALSE = TRUE. PG: `5<>N AND 5<>N` = N. |
| `5 <> ALL([3, NULL])` | **T** | N | Same shape.
|
| `5 > ALL([3, NULL])` | **T** | N | `array_max([3,NULL])` =
3, so `5 > 3` = TRUE. PG: `5>3 AND 5>N` = `T AND N` = N. |
| `5 < ANY([3, NULL])` | **F** | N | `array_max([3,NULL])` =
3, so `5 < 3` = FALSE. PG: `5<3 OR 5<N` = `F OR N` = N. |
| `5 >= ALL([3, NULL])` | **T** | N | Same pattern as `>`.
|
| `5 = ALL([5, NULL])` | **T** | N | min=max=5, so both `5=5`.
PG: `5=5 AND 5=N` = `T AND N` = N. (Subtle — needle equals the only non-NULL.) |
| `5 <> ANY([5, NULL])` | **F** | N | min=max=5, so both `5<>5`
= FALSE. PG: `5<>5 OR 5<>N` = `F OR N` = N. |
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]