andygrove opened a new issue, #4502:
URL: https://github.com/apache/datafusion-comet/issues/4502

   Tracking issue for follow-up work surfaced by the date/time expression audit 
in #4448. Each item below is either a support-level / serde consistency fix 
that the audit deliberately deferred (because it needs a semantics decision 
rather than a mechanical edit), or a divergence the audit documented but did 
not file separately. Findings already covered by an issue filed during the PR 
(`#4449` ANSI `next_day`, `#4450` `next_day` trim, `#4451` ANSI `make_date`, 
`#3180` `Hour`/`Minute`/`Second` TimestampNTZ, `#2013` legacy timezone forms, 
`#2649` non-UTC `date_trunc`, `#16594` / `#16577` `from_unixtime` upstream) are 
intentionally excluded.
   
   ## High priority
   
   ### 1. `CometUnixTimestamp`: `getUnsupportedReasons` and 
`isSupportedInputType` disagree on `TimestampNTZType`
   
   `spark/src/main/scala/org/apache/comet/serde/datetime.scala:309-362`. 
`getUnsupportedReasons` claims `TimestampNTZType is not supported because Comet 
incorrectly applies timezone conversion to TimestampNTZ values`, but 
`isSupportedInputType` returns `true` for `TimestampNTZType` and 
`getSupportLevel` reports `Compatible()` for it. Either the predicate or the 
reason is wrong. Per the audit-comet-expression skill, the EXPLAIN-time reason 
must match the runtime behaviour: if the native path really does apply 
session-timezone conversion to `TIMESTAMP_NTZ` (as the documented divergence on 
`unix_timestamp` in `spark_expressions_support.md` suggests), the support level 
for that input type should be `Incompatible(Some(...))` and cross-reference 
#3180; if it does not, the reason text should be removed.
   
   ### 2. `CometDateFormat`: gate the `Compatible` decision in 
`getSupportLevel`, not `convert`
   
   `spark/src/main/scala/org/apache/comet/serde/datetime.scala:609-700`. 
`getSupportLevel` returns `Compatible()` unconditionally, while `convert` reads 
`CometConf.isExprAllowIncompat(...)` and branches on UTC vs non-UTC session 
timezone, on whether the format is a whitelisted literal, and on whether the 
codegen dispatcher is enabled. The audit-comet-expression skill (rule 10) 
requires expression-shape restrictions to be declared in `getSupportLevel` so 
EXPLAIN, the auto-generated compatibility doc, and the dispatcher can all see 
them. Moving this gating changes the dispatch flow slightly (non-UTC + 
whitelisted format becomes `Incompatible` rather than silently routed through 
the codegen dispatcher) and needs a separate review.
   
   ### 3. Group A codegen-dispatched datetime expressions report `Compatible` 
while silently falling back when 
`spark.comet.exec.scalaUDF.codegen.enabled=false`
   
   For the codegen-dispatched datetime expressions (notably `to_unix_timestamp` 
and `date_format` per the documented divergence in 
`docs/source/contributor-guide/spark_expressions_support.md`), 
`getSupportLevel` returns `Compatible()` while `convert` returns `None` when 
the Scala UDF codegen dispatcher is disabled. The dispatcher flag is not 
surfaced in EXPLAIN or in the auto-generated compatibility guide, so users see 
`Compatible` while the operator actually falls back to Spark. Per the audit 
skill, this should either be reported as `Incompatible(Some(\"requires 
spark.comet.exec.scalaUDF.codegen.enabled=true\"))` or the dispatcher 
requirement should be surfaced in `getCompatibleNotes`.
   
   ### 4. `CometMakeTimestamp` does not honour `spark.sql.ansi.enabled`
   
   `MakeTimestamp` honours `spark.sql.ansi.enabled` (throws on invalid `(year, 
month, day, hour, min, sec[, timezone])`, else NULL) in Spark 3.4 / 3.5 / 4.0. 
The Comet wiring does not check `failOnError`, so under ANSI mode Comet returns 
NULL where Spark would throw `ansiDateTimeArgumentOutOfRange` / 
`invalidFractionOfSecondError`. Same shape as the `make_date` ANSI divergence 
captured in #4451 / `make_date_ansi.sql`. Capture an analogous 
`make_timestamp_ansi.sql` regression test once a fix lands, and gate the 
support level appropriately in the meantime.
   
   ## Medium priority
   
   ### 5. Spark 4.0 `StringTypeWithCollation` not gated for datetime expressions
   
   The audit documents (in 
`docs/source/contributor-guide/spark_expressions_support.md`) that Spark 4.0 
widens many string-typed `inputTypes` on datetime expressions 
(`convert_timezone`, `date_format`, `date_trunc`, `from_unixtime`, 
`make_timestamp`, `next_day`, `to_unix_timestamp`, `trunc`, `unix_timestamp`) 
to `StringTypeWithCollation(supportsTrimCollation = true)`. Per the audit skill 
(rule 11), non-default collations should flip the support level to 
`Incompatible(Some(...))`. Today the serdes accept those inputs without 
distinguishing collation, so non-default collations are silently treated as 
compatible. Cross-references #2190 and #4496.
   
   ### 6. `CometDateDiff` non-wrapping `i32 -` panics in debug builds on 
extreme inputs
   
   Documented divergence (line 30 of the PR diff): the native `date_diff` impl 
subtracts with `i32 -`, which Rust panics on in debug builds when the result 
overflows; Spark wraps. Practically unreachable for `DateType` inputs (the 
valid range is far smaller than `i32::MAX`), but it is an unmasked panic path. 
Either switch to `wrapping_sub` to match Spark, or document that release builds 
wrap (they do) and add a debug-assert pre-check.
   
   Surfaced by the `audit-comet-expression` skill run in #4448.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to