andygrove opened a new issue, #4756:
URL: https://github.com/apache/datafusion-comet/issues/4756

   Spun out from item #4 of #4500 to defer until the in-flight interval-support 
PRs land.
   
   ## Background
   
   `Greatest` / `Least` are registered as bare 
`CometScalarFunction("greatest")` / `CometScalarFunction("least")` with no 
input-type gating. The #4500 audit flagged this as a potential runtime-error 
risk for types the native UDF cannot handle (intervals, Spark-specific 
orderings).
   
   ## Investigation (current `main`)
   
   Probed `greatest` / `least` on Spark 3.5 and 4.0 across a wide range of 
input types, comparing Comet output against Spark and recording whether the 
projection ran natively:
   
   | Input type | Runs natively | Matches Spark |
   | --- | --- | --- |
   | int / mixed numeric | yes | yes |
   | double incl. `NaN` / `-0.0` / `Infinity` | yes | yes |
   | decimal (mixed precision/scale) | yes | yes |
   | string (default collation) | yes | yes |
   | date / timestamp | yes | yes |
   | binary | yes | yes |
   | array (incl. with-null, nested, array-of-struct) | yes | yes |
   | struct (incl. with-null) | yes | yes |
   | null-only | yes | yes |
   | day-time / year-month interval | no (falls back) | yes |
   | collated string (Spark 4.x, e.g. `UTF8_LCASE`) | no (falls back) | yes |
   
   **No live defect today.** Unsupported types (intervals, collated strings) 
fail serialization and fall back to Spark cleanly: no runtime error, no silent 
divergence. The supported types run natively and match Spark, including NaN 
ordering and nested/null cases.
   
   Note: the issue's suggested fix (gate to scalar sort element types via 
`supportedScalarSortElementType`) would be a **net regression**, since it would 
disable the array / struct / decimal cases that currently work natively and 
correctly.
   
   ## Why this is deferred
   
   There are pending PRs adding native interval support. Once intervals 
serialize, `greatest` / `least` over interval columns will start reaching the 
native UDF instead of falling back, so the correctness / gating story must be 
re-evaluated against arrow's comparator behavior for `Interval` / `Duration` 
types at that point.
   
   ## Deferred work
   
   - [ ] After interval support lands, re-verify `greatest` / `least` over 
`DayTimeIntervalType` / `YearMonthIntervalType` (correctness vs Spark; add a 
`getSupportLevel` gate only if a real divergence/error appears).
   - [ ] Add `CometExpressionSuite` regression tests locking in the verified 
native behavior (numerics/NaN, decimal, string, date, timestamp, binary, arrays 
incl. nested/with-null, structs incl. with-null, null-only) plus the graceful 
interval / collated-string fallback.
   - [ ] Revisit collated-string handling under the collation umbrella (#4496) 
if collated `greatest` / `least` ever becomes natively supported.
   
   Surfaced by the `audit-comet-expression` skill run in #4486.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to