sweb opened a new pull request, #22628:
URL: https://github.com/apache/datafusion/pull/22628

   ## Which issue does this PR close?
   
   - Closes #22490.
   
   ## Rationale for this change
   
   Per IEEE 754 default semantics, `-0.0 == +0.0` (and `-0.0 < +0.0` is false). 
PostgreSQL, DuckDB, and Python all follow this. DataFusion currently treats 
`-0.0` as strictly less than `+0.0` because arrow-rs' comparison kernels 
intentionally use totalOrder semantics. This produces surprising results in 
`WHERE` filters, `IN` lists, and `IS [NOT] DISTINCT FROM`, especially when 
`-0.0` is produced by arithmetic on a column (e.g. `x * -1` where `x = 0.0`).
   
   See also 
https://github.com/apache/arrow-rs/blob/58.3.0/arrow-ord/src/cmp.rs#L66-L80
   
   This was debugged, replicated and further explored using Claude Code. 
However, the result was adjusted and further improved.
   
   ## What changes are included in this PR?
   
   * Add `normalize_neg_zero` / `normalize_neg_zero_array` /  
`normalize_neg_zero_scalar` in `datafusion-physical-expr-common::datum`. These 
rewrite `-0.0` to `+0.0` for float inputs and pass arrays through unchanged (no 
allocation) when no `-0.0` is present.
   * Apply the normalization in `apply_cmp` so all comparison operators (`=`, 
`<>`, `<`, `<=`, `>`, `>=`, distinct / not-distinct, like / ilike) inherit IEEE 
754 zero semantics.
   * Apply it in `InListExpr` for both the dynamic comparator path and the 
per-list-expression normalization. For the primitive static filter (which 
hashes via `OrderedFloat`), inserting `0.0` now also inserts `-0.0` (and vice 
versa) so set membership matches the normalized comparison semantics.
   
   ## Are these changes tested?
   
   Yes:
   * Unit tests in `datum.rs` cover float normalization and check for 
passthrough when no `-0.0` is there, also dictionaries.
   * New sqllogictest for this particular case - IEEE 754 may be problematic in 
other places as well that this PR does not touch.
   
   ## Are there any user-facing changes?
   
   Yes, comparisons against `-0.0` change with this.
   
   Since this introduces an extra step for comparisons, this also has 
performance implications - I tried to reduce this by checking for float first 
and it reduces the performance hit but I did not get very consistent benchmark 
results. Please check whether IEEE 754 behavior for `-0.0` is desirable for 
DataFusion and whether this line of implementation fits.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to