sweb opened a new pull request, #22628: URL: https://github.com/apache/datafusion/pull/22628
## Which issue does this PR close? - Closes #22490. ## Rationale for this change Per IEEE 754 default semantics, `-0.0 == +0.0` (and `-0.0 < +0.0` is false). PostgreSQL, DuckDB, and Python all follow this. DataFusion currently treats `-0.0` as strictly less than `+0.0` because arrow-rs' comparison kernels intentionally use totalOrder semantics. This produces surprising results in `WHERE` filters, `IN` lists, and `IS [NOT] DISTINCT FROM`, especially when `-0.0` is produced by arithmetic on a column (e.g. `x * -1` where `x = 0.0`). See also https://github.com/apache/arrow-rs/blob/58.3.0/arrow-ord/src/cmp.rs#L66-L80 This was debugged, replicated and further explored using Claude Code. However, the result was adjusted and further improved. ## What changes are included in this PR? * Add `normalize_neg_zero` / `normalize_neg_zero_array` / `normalize_neg_zero_scalar` in `datafusion-physical-expr-common::datum`. These rewrite `-0.0` to `+0.0` for float inputs and pass arrays through unchanged (no allocation) when no `-0.0` is present. * Apply the normalization in `apply_cmp` so all comparison operators (`=`, `<>`, `<`, `<=`, `>`, `>=`, distinct / not-distinct, like / ilike) inherit IEEE 754 zero semantics. * Apply it in `InListExpr` for both the dynamic comparator path and the per-list-expression normalization. For the primitive static filter (which hashes via `OrderedFloat`), inserting `0.0` now also inserts `-0.0` (and vice versa) so set membership matches the normalized comparison semantics. ## Are these changes tested? Yes: * Unit tests in `datum.rs` cover float normalization and check for passthrough when no `-0.0` is there, also dictionaries. * New sqllogictest for this particular case - IEEE 754 may be problematic in other places as well that this PR does not touch. ## Are there any user-facing changes? Yes, comparisons against `-0.0` change with this. Since this introduces an extra step for comparisons, this also has performance implications - I tried to reduce this by checking for float first and it reduces the performance hit but I did not get very consistent benchmark results. Please check whether IEEE 754 behavior for `-0.0` is desirable for DataFusion and whether this line of implementation fits. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
