andygrove opened a new issue, #4502: URL: https://github.com/apache/datafusion-comet/issues/4502
Tracking issue for follow-up work surfaced by the date/time expression audit in #4448. Each item below is either a support-level / serde consistency fix that the audit deliberately deferred (because it needs a semantics decision rather than a mechanical edit), or a divergence the audit documented but did not file separately. Findings already covered by an issue filed during the PR (`#4449` ANSI `next_day`, `#4450` `next_day` trim, `#4451` ANSI `make_date`, `#3180` `Hour`/`Minute`/`Second` TimestampNTZ, `#2013` legacy timezone forms, `#2649` non-UTC `date_trunc`, `#16594` / `#16577` `from_unixtime` upstream) are intentionally excluded. ## High priority ### 1. `CometUnixTimestamp`: `getUnsupportedReasons` and `isSupportedInputType` disagree on `TimestampNTZType` `spark/src/main/scala/org/apache/comet/serde/datetime.scala:309-362`. `getUnsupportedReasons` claims `TimestampNTZType is not supported because Comet incorrectly applies timezone conversion to TimestampNTZ values`, but `isSupportedInputType` returns `true` for `TimestampNTZType` and `getSupportLevel` reports `Compatible()` for it. Either the predicate or the reason is wrong. Per the audit-comet-expression skill, the EXPLAIN-time reason must match the runtime behaviour: if the native path really does apply session-timezone conversion to `TIMESTAMP_NTZ` (as the documented divergence on `unix_timestamp` in `spark_expressions_support.md` suggests), the support level for that input type should be `Incompatible(Some(...))` and cross-reference #3180; if it does not, the reason text should be removed. ### 2. `CometDateFormat`: gate the `Compatible` decision in `getSupportLevel`, not `convert` `spark/src/main/scala/org/apache/comet/serde/datetime.scala:609-700`. `getSupportLevel` returns `Compatible()` unconditionally, while `convert` reads `CometConf.isExprAllowIncompat(...)` and branches on UTC vs non-UTC session timezone, on whether the format is a whitelisted literal, and on whether the codegen dispatcher is enabled. The audit-comet-expression skill (rule 10) requires expression-shape restrictions to be declared in `getSupportLevel` so EXPLAIN, the auto-generated compatibility doc, and the dispatcher can all see them. Moving this gating changes the dispatch flow slightly (non-UTC + whitelisted format becomes `Incompatible` rather than silently routed through the codegen dispatcher) and needs a separate review. ### 3. Group A codegen-dispatched datetime expressions report `Compatible` while silently falling back when `spark.comet.exec.scalaUDF.codegen.enabled=false` For the codegen-dispatched datetime expressions (notably `to_unix_timestamp` and `date_format` per the documented divergence in `docs/source/contributor-guide/spark_expressions_support.md`), `getSupportLevel` returns `Compatible()` while `convert` returns `None` when the Scala UDF codegen dispatcher is disabled. The dispatcher flag is not surfaced in EXPLAIN or in the auto-generated compatibility guide, so users see `Compatible` while the operator actually falls back to Spark. Per the audit skill, this should either be reported as `Incompatible(Some(\"requires spark.comet.exec.scalaUDF.codegen.enabled=true\"))` or the dispatcher requirement should be surfaced in `getCompatibleNotes`. ### 4. `CometMakeTimestamp` does not honour `spark.sql.ansi.enabled` `MakeTimestamp` honours `spark.sql.ansi.enabled` (throws on invalid `(year, month, day, hour, min, sec[, timezone])`, else NULL) in Spark 3.4 / 3.5 / 4.0. The Comet wiring does not check `failOnError`, so under ANSI mode Comet returns NULL where Spark would throw `ansiDateTimeArgumentOutOfRange` / `invalidFractionOfSecondError`. Same shape as the `make_date` ANSI divergence captured in #4451 / `make_date_ansi.sql`. Capture an analogous `make_timestamp_ansi.sql` regression test once a fix lands, and gate the support level appropriately in the meantime. ## Medium priority ### 5. Spark 4.0 `StringTypeWithCollation` not gated for datetime expressions The audit documents (in `docs/source/contributor-guide/spark_expressions_support.md`) that Spark 4.0 widens many string-typed `inputTypes` on datetime expressions (`convert_timezone`, `date_format`, `date_trunc`, `from_unixtime`, `make_timestamp`, `next_day`, `to_unix_timestamp`, `trunc`, `unix_timestamp`) to `StringTypeWithCollation(supportsTrimCollation = true)`. Per the audit skill (rule 11), non-default collations should flip the support level to `Incompatible(Some(...))`. Today the serdes accept those inputs without distinguishing collation, so non-default collations are silently treated as compatible. Cross-references #2190 and #4496. ### 6. `CometDateDiff` non-wrapping `i32 -` panics in debug builds on extreme inputs Documented divergence (line 30 of the PR diff): the native `date_diff` impl subtracts with `i32 -`, which Rust panics on in debug builds when the result overflows; Spark wraps. Practically unreachable for `DateType` inputs (the valid range is far smaller than `i32::MAX`), but it is an unmasked panic path. Either switch to `wrapping_sub` to match Spark, or document that release builds wrap (they do) and add a debug-assert pre-check. Surfaced by the `audit-comet-expression` skill run in #4448. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
