[PR] docs: refresh expression audit notes for resolved correctness issues [datafusion-comet]

via GitHub Mon, 29 Jun 2026 13:49:19 -0700


andygrove opened a new pull request, #4762:
URL: https://github.com/apache/datafusion-comet/pull/4762


   ## Which issue does this PR close?
   
   Documentation refresh, not tied to a single issue. The notes updated here 
correspond to already-closed issues #4463, #4464, #4465, #4481, #4482, #4554, 
and #4681.
   
   ## Rationale for this change
   
   A sweep of `docs/source/contributor-guide/expression-audits` found that 
several pages still describe correctness gaps as live, even though the 
underlying issues have been fixed and merged. Audit pages are meant to describe 
current behavior, so a reader (or the `audit-comet-expression` skill) would 
otherwise be misled into thinking these expressions still diverge from Spark on 
the default config.
   
   ## What changes are included in this PR?
   
   Updated audit notes to reflect the current code:
   
   - `array_contains` / `array_distinct` / `array_except` / `array_max` / 
`array_min` / `array_union`: float/double arrays with NaN and signed zero now 
match Spark, because DataFusion canonicalizes them. The stale "Known 
divergence" notes (#4481, #4482) are replaced with the current behavior. The 
`array_union` ordering caveat is also resolved (#4681).
   - `array_intersect`: now reports `Incompatible` with a codegen-dispatch 
fallback, so it is Spark-correct by default; the native path (different element 
ordering) is used only when incompatible expressions are explicitly allowed.
   - `bit_length` / `octet_length`: `BinaryType` input is now reported 
`Unsupported` and falls back to Spark cleanly instead of failing in native 
execution (#4464).
   - `translate`: now reports `Incompatible` (graphemes vs code points, U+0000 
handling) and falls back by default (#4463).
   - `decode`: now routed through the codegen dispatcher on all versions, 
honouring the `charset` argument and the Spark 4.0 `legacyCharsets` / 
`legacyErrorAction` flags (#4465).
   - `try_make_timestamp`: now routed through the codegen dispatcher, returning 
NULL for invalid inputs to match Spark (#4554).
   - `from_utc_timestamp` / `to_utc_timestamp`: now report `Incompatible` with 
a codegen-dispatch fallback, so Spark's legacy zone forms (`GMT+1`, `UTC+1`, 
`PST`) are Spark-correct by default; the native parser (IANA / `+HH:MM` only) 
is used only under the opt-in path (#2013).
   
   ## How are these changes tested?
   
   Documentation-only change. Each updated note was verified against the 
current serde / native implementation (`getSupportLevel` gating, 
codegen-dispatch wiring, and the DataFusion canonicalization behavior).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] docs: refresh expression audit notes for resolved correctness issues [datafusion-comet]

Reply via email to