andygrove opened a new issue, #4756:
URL: https://github.com/apache/datafusion-comet/issues/4756
Spun out from item #4 of #4500 to defer until the in-flight interval-support
PRs land.
## Background
`Greatest` / `Least` are registered as bare
`CometScalarFunction("greatest")` / `CometScalarFunction("least")` with no
input-type gating. The #4500 audit flagged this as a potential runtime-error
risk for types the native UDF cannot handle (intervals, Spark-specific
orderings).
## Investigation (current `main`)
Probed `greatest` / `least` on Spark 3.5 and 4.0 across a wide range of
input types, comparing Comet output against Spark and recording whether the
projection ran natively:
| Input type | Runs natively | Matches Spark |
| --- | --- | --- |
| int / mixed numeric | yes | yes |
| double incl. `NaN` / `-0.0` / `Infinity` | yes | yes |
| decimal (mixed precision/scale) | yes | yes |
| string (default collation) | yes | yes |
| date / timestamp | yes | yes |
| binary | yes | yes |
| array (incl. with-null, nested, array-of-struct) | yes | yes |
| struct (incl. with-null) | yes | yes |
| null-only | yes | yes |
| day-time / year-month interval | no (falls back) | yes |
| collated string (Spark 4.x, e.g. `UTF8_LCASE`) | no (falls back) | yes |
**No live defect today.** Unsupported types (intervals, collated strings)
fail serialization and fall back to Spark cleanly: no runtime error, no silent
divergence. The supported types run natively and match Spark, including NaN
ordering and nested/null cases.
Note: the issue's suggested fix (gate to scalar sort element types via
`supportedScalarSortElementType`) would be a **net regression**, since it would
disable the array / struct / decimal cases that currently work natively and
correctly.
## Why this is deferred
There are pending PRs adding native interval support. Once intervals
serialize, `greatest` / `least` over interval columns will start reaching the
native UDF instead of falling back, so the correctness / gating story must be
re-evaluated against arrow's comparator behavior for `Interval` / `Duration`
types at that point.
## Deferred work
- [ ] After interval support lands, re-verify `greatest` / `least` over
`DayTimeIntervalType` / `YearMonthIntervalType` (correctness vs Spark; add a
`getSupportLevel` gate only if a real divergence/error appears).
- [ ] Add `CometExpressionSuite` regression tests locking in the verified
native behavior (numerics/NaN, decimal, string, date, timestamp, binary, arrays
incl. nested/with-null, structs incl. with-null, null-only) plus the graceful
interval / collated-string fallback.
- [ ] Revisit collated-string handling under the collation umbrella (#4496)
if collated `greatest` / `least` ever becomes natively supported.
Surfaced by the `audit-comet-expression` skill run in #4486.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]